# Role: ollama

## Purpose

Install, configure, and maintain Ollama inference server(s) on the AI server host.
Two instances run simultaneously — one per NUMA socket — to utilize both CPU sockets
on the Dell M630 (2× E5-2690v4).

## Instances

| Service                | Port  | NUMA Node | CPUs (physical only) | RAM binding | Purpose          |
|------------------------|-------|-----------|----------------------|-------------|------------------|
| `ollama.service`       | 11434 | Node 1    | 1 3 5 … 27 (odd)     | `--membind=1` | General models |
| `ollama-node0.service` | 11435 | Node 0    | 0 2 4 … 26 (even)    | `--membind=0` | Coding models  |

Both instances share the same model storage directory (`/mnt/ai_data/ollama_models`)
and Ollama API key. Weights are loaded once into the NUMA node's memory; they are not
duplicated between instances.

## Configuration

### Node 1 — systemd override

Applied via `/etc/systemd/system/ollama.service.d/override.conf` (templated from
`templates/ollama/override.conf.j2`):

| Variable                   | Value                        | Description                                      |
|----------------------------|------------------------------|--------------------------------------------------|
| `OLLAMA_API_KEY`           | (from Vault)                 | Shared key for all API requests                  |
| `OLLAMA_HOST`              | `0.0.0.0:11434`              | Listen on all interfaces, port 11434             |
| `OLLAMA_MODELS`            | `/mnt/ai_data/ollama_models` | Shared model storage                             |
| `OLLAMA_KEEP_ALIVE`        | `-1`                         | Never unload models from RAM                     |
| `OLLAMA_FLASH_ATTENTION`   | `1`                          | Fused softmax — ~20% less memory bandwidth       |
| `OLLAMA_NUM_THREADS`       | `14`                         | Physical cores on NUMA node 1 only               |
| `OLLAMA_NUM_PARALLEL`      | `2`                          | Concurrent inference streams per instance        |
| `OLLAMA_MAX_LOADED_MODELS` | `3`                          | 3 models warm per instance (6 total)             |
| `CPUAffinity`              | `1 3 5 … 27`                 | Odd CPUs = socket 1 physical cores               |
| `ExecStart`                | `numactl --membind=1 ollama serve` | Pin memory allocations to Node 1 RAM        |

### Node 0 — standalone systemd unit

Deployed to `/etc/systemd/system/ollama-node0.service` (from
`templates/ollama/ollama-node0.service.j2`). Uses the same variables but with:

| Variable   | Value           |
|------------|-----------------|
| `OLLAMA_HOST` | `0.0.0.0:11435` |
| `CPUAffinity` | `0 2 4 … 26` |
| `ExecStart`   | `numactl --membind=0 ollama serve` |

## NUMA Rationale

On the M630 with dual E5-2690v4:
- **Node 1** (odd CPUs) has ~120 GB free RAM — assigned general models (larger)
- **Node 0** (even CPUs) has ~75 GB free RAM — assigned coding models

Without `numactl --membind`, the OS allocates model weights and KV cache across both
nodes, causing cross-socket memory traffic (~40 GB/s vs ~68–75 GB/s local).
`CPUAffinity` alone sets the scheduler; `numactl` sets the memory policy.

## OLLAMA_FLASH_ATTENTION

Enables fused softmax kernel — reduces attention memory bandwidth by ~20% and improves
throughput at all context lengths on AVX2 (E5-2690v4). Note: `OLLAMA_KV_CACHE_TYPE`
is intentionally **not** set — q8_0 dequantization overhead regressed throughput on
this CPU despite the bandwidth savings.

## Upgrade Procedure

```bash
ansible-playbook playbooks/02_infrastructure.yml -K -e @local.yml --tags ollama
```

The official install script detects the existing installation and performs an in-place
upgrade. Both `ollama.service` and `ollama-node0.service` are restarted.

## Tags

```bash
ansible-playbook playbooks/site.yml --tags ollama -K -e @local.yml
```