# 21 — Self-Host Migration (Ollama → vLLM on Kubernetes)

## Context
Fogbreak started with Ollama for local AI inference (instruction 03). As the platform scales to multiple brokerages and markets, migrate to production-grade vLLM on Kubernetes for reliability, throughput, and multi-model serving.

## When to Migrate
- More than 1,000 daily AI requests
- Multiple tenants actively using AI features
- Need for GPU redundancy (single-GPU Ollama is a single point of failure)
- Fine-tuned models ready for deployment

## What to Build

### 1. vLLM Setup
```bash
# Install vLLM
pip install vllm

# Serve model
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.3-70B \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --port 8000
```

### 2. Kubernetes Deployment
```yaml
# k8s/vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          resources:
            limits:
              nvidia.com/gpu: 1
          ports:
            - containerPort: 8000
          env:
            - name: MODEL
              value: "meta-llama/Llama-3.3-70B"
---
# Separate deployment for fast model
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-mistral
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          resources:
            limits:
              nvidia.com/gpu: 1
          ports:
            - containerPort: 8001
          env:
            - name: MODEL
              value: "mistralai/Mistral-7B-Instruct-v0.3"
```

### 3. Update AI Proxy
Update `ai/models/ollama_client.py` → `ai/models/vllm_client.py`:
- vLLM serves OpenAI-compatible API
- Switch client from Ollama HTTP to OpenAI-compatible endpoint
- Model router unchanged (same routing logic)
- Health checks against vLLM endpoints

### 4. Fine-Tuning Pipeline
Once you have enough data:
- Export successful listing descriptions (agent-approved)
- Export successful email campaigns (high open/click rates)
- Export successful social posts (high engagement)
- Fine-tune Mistral 7B on this data
- Deploy fine-tuned model alongside base models
- A/B test fine-tuned vs. base

### 5. Monitoring
- Prometheus metrics: request latency, queue depth, GPU utilization
- Grafana dashboards for inference performance
- Alerts: latency > 5s, GPU OOM, model crash
- Request logging for quality monitoring

### 6. Multi-Region (Nationwide)
When serving brokerages across the US:
- East Coast cluster (Virginia)
- Central cluster (Iowa)
- West Coast cluster (Oregon)
- Route requests to nearest cluster
- Shared model weights, independent inference

## Migration Checklist
- [ ] vLLM serving both models (Llama 3.3 + Mistral)
- [ ] Kubernetes deployment with GPU scheduling
- [ ] FastAPI proxy updated to call vLLM instead of Ollama
- [ ] Zero downtime migration (run both during transition)
- [ ] Health checks and auto-restart
- [ ] Monitoring with Prometheus/Grafana
- [ ] Fine-tuning pipeline with agent-approved data
- [ ] Multi-region deployment plan documented
