Install, configure, and maintain Ollama inference server(s) on the AI server host. Two instances run simultaneously — one per NUMA socket — to utilize both CPU sockets on the Dell M630 (2× E5-2690v4).
| Service | Port | NUMA Node | CPUs (physical only) | RAM binding | Purpose |
|---|---|---|---|---|---|
ollama.service |
11434 | Node 1 | 1 3 5 … 27 (odd) | --membind=1 |
General models |
ollama-node0.service |
11435 | Node 0 | 0 2 4 … 26 (even) | --membind=0 |
Coding models |
Both instances share the same model storage directory (/mnt/ai_data/ollama_models)
and Ollama API key. Weights are loaded once into the NUMA node's memory; they are not
duplicated between instances.
Applied via /etc/systemd/system/ollama.service.d/override.conf (templated from
templates/ollama/override.conf.j2):
| Variable | Value | Description |
|---|---|---|
OLLAMA_API_KEY |
(from Vault) | Shared key for all API requests |
OLLAMA_HOST |
0.0.0.0:11434 |
Listen on all interfaces, port 11434 |
OLLAMA_MODELS |
/mnt/ai_data/ollama_models |
Shared model storage |
OLLAMA_KEEP_ALIVE |
-1 |
Never unload models from RAM |
OLLAMA_FLASH_ATTENTION |
1 |
Fused softmax — ~20% less memory bandwidth |
OLLAMA_NUM_THREADS |
14 |
Physical cores on NUMA node 1 only |
OLLAMA_NUM_PARALLEL |
2 |
Concurrent inference streams per instance |
OLLAMA_MAX_LOADED_MODELS |
3 |
3 models warm per instance (6 total) |
CPUAffinity |
1 3 5 … 27 |
Odd CPUs = socket 1 physical cores |
ExecStart |
numactl --membind=1 ollama serve |
Pin memory allocations to Node 1 RAM |
Deployed to /etc/systemd/system/ollama-node0.service (from
templates/ollama/ollama-node0.service.j2). Uses the same variables but with:
| Variable | Value |
|---|---|
OLLAMA_HOST |
0.0.0.0:11435 |
CPUAffinity |
0 2 4 … 26 |
ExecStart |
numactl --membind=0 ollama serve |
On the M630 with dual E5-2690v4:
Without numactl --membind, the OS allocates model weights and KV cache across both
nodes, causing cross-socket memory traffic (~40 GB/s vs ~68–75 GB/s local).
CPUAffinity alone sets the scheduler; numactl sets the memory policy.
Enables fused softmax kernel — reduces attention memory bandwidth by ~20% and improves
throughput at all context lengths on AVX2 (E5-2690v4). Note: OLLAMA_KV_CACHE_TYPE
is intentionally not set — q8_0 dequantization overhead regressed throughput on
this CPU despite the bandwidth savings.
ansible-playbook playbooks/02_infrastructure.yml -K -e @local.yml --tags ollama
The official install script detects the existing installation and performs an in-place
upgrade. Both ollama.service and ollama-node0.service are restarted.
ansible-playbook playbooks/site.yml --tags ollama -K -e @local.yml