Shaun Arman 55d412f85d Add three-pass benchmark with size-aware tier routing		5 дней назад
..
README.md	55d412f85d Add three-pass benchmark with size-aware tier routing	5 дней назад

Role: ollama

Purpose

Install, configure, and maintain Ollama inference server(s) on the AI server host. Two instances run simultaneously — one per NUMA socket — to utilize both CPU sockets on the Dell M630 (2× E5-2690v4).

Instances

Service	Port	NUMA Node	CPUs (physical only)	RAM binding	Purpose
`ollama.service`	11434	Node 1	1 3 5 … 27 (odd)	`--membind=1`	General models
`ollama-node0.service`	11435	Node 0	0 2 4 … 26 (even)	`--membind=0`	Coding models

Both instances share the same model storage directory (/mnt/ai_data/ollama_models) and Ollama API key. Weights are loaded once into the NUMA node's memory; they are not duplicated between instances.

Configuration

Node 1 — systemd override

Applied via /etc/systemd/system/ollama.service.d/override.conf (templated from templates/ollama/override.conf.j2):

Variable	Value	Description
`OLLAMA_API_KEY`	(from Vault)	Shared key for all API requests
`OLLAMA_HOST`	`0.0.0.0:11434`	Listen on all interfaces, port 11434
`OLLAMA_MODELS`	`/mnt/ai_data/ollama_models`	Shared model storage
`OLLAMA_KEEP_ALIVE`	`-1`	Never unload models from RAM
`OLLAMA_FLASH_ATTENTION`	`1`	Fused softmax — ~20% less memory bandwidth
`OLLAMA_NUM_THREADS`	`14`	Physical cores on NUMA node 1 only
`OLLAMA_NUM_PARALLEL`	`2`	Concurrent inference streams per instance
`OLLAMA_MAX_LOADED_MODELS`	`3`	3 models warm per instance (6 total)
`CPUAffinity`	`1 3 5 … 27`	Odd CPUs = socket 1 physical cores
`ExecStart`	`numactl --membind=1 ollama serve`	Pin memory allocations to Node 1 RAM

Node 0 — standalone systemd unit

Deployed to /etc/systemd/system/ollama-node0.service (from templates/ollama/ollama-node0.service.j2). Uses the same variables but with:

Variable	Value
`OLLAMA_HOST`	`0.0.0.0:11435`
`CPUAffinity`	`0 2 4 … 26`
`ExecStart`	`numactl --membind=0 ollama serve`

NUMA Rationale

On the M630 with dual E5-2690v4:

Node 1 (odd CPUs) has ~120 GB free RAM — assigned general models (larger)
Node 0 (even CPUs) has ~75 GB free RAM — assigned coding models

Without numactl --membind, the OS allocates model weights and KV cache across both nodes, causing cross-socket memory traffic (~40 GB/s vs ~68–75 GB/s local). CPUAffinity alone sets the scheduler; numactl sets the memory policy.

OLLAMA_FLASH_ATTENTION

Enables fused softmax kernel — reduces attention memory bandwidth by ~20% and improves throughput at all context lengths on AVX2 (E5-2690v4). Note: OLLAMA_KV_CACHE_TYPE is intentionally not set — q8_0 dequantization overhead regressed throughput on this CPU despite the bandwidth savings.

Upgrade Procedure

ansible-playbook playbooks/02_infrastructure.yml -K -e @local.yml --tags ollama

The official install script detects the existing installation and performs an in-place upgrade. Both ollama.service and ollama-node0.service are restarted.

README.md