Shaun Arman 55d412f85d Add three-pass benchmark with size-aware tier routing 5 дней назад
..
README.md 55d412f85d Add three-pass benchmark with size-aware tier routing 5 дней назад

README.md

Role: ollama

Purpose

Install, configure, and maintain Ollama inference server(s) on the AI server host. Two instances run simultaneously — one per NUMA socket — to utilize both CPU sockets on the Dell M630 (2× E5-2690v4).

Instances

Service Port NUMA Node CPUs (physical only) RAM binding Purpose
ollama.service 11434 Node 1 1 3 5 … 27 (odd) --membind=1 General models
ollama-node0.service 11435 Node 0 0 2 4 … 26 (even) --membind=0 Coding models

Both instances share the same model storage directory (/mnt/ai_data/ollama_models) and Ollama API key. Weights are loaded once into the NUMA node's memory; they are not duplicated between instances.

Configuration

Node 1 — systemd override

Applied via /etc/systemd/system/ollama.service.d/override.conf (templated from templates/ollama/override.conf.j2):

Variable Value Description
OLLAMA_API_KEY (from Vault) Shared key for all API requests
OLLAMA_HOST 0.0.0.0:11434 Listen on all interfaces, port 11434
OLLAMA_MODELS /mnt/ai_data/ollama_models Shared model storage
OLLAMA_KEEP_ALIVE -1 Never unload models from RAM
OLLAMA_FLASH_ATTENTION 1 Fused softmax — ~20% less memory bandwidth
OLLAMA_NUM_THREADS 14 Physical cores on NUMA node 1 only
OLLAMA_NUM_PARALLEL 2 Concurrent inference streams per instance
OLLAMA_MAX_LOADED_MODELS 3 3 models warm per instance (6 total)
CPUAffinity 1 3 5 … 27 Odd CPUs = socket 1 physical cores
ExecStart numactl --membind=1 ollama serve Pin memory allocations to Node 1 RAM

Node 0 — standalone systemd unit

Deployed to /etc/systemd/system/ollama-node0.service (from templates/ollama/ollama-node0.service.j2). Uses the same variables but with:

Variable Value
OLLAMA_HOST 0.0.0.0:11435
CPUAffinity 0 2 4 … 26
ExecStart numactl --membind=0 ollama serve

NUMA Rationale

On the M630 with dual E5-2690v4:

  • Node 1 (odd CPUs) has ~120 GB free RAM — assigned general models (larger)
  • Node 0 (even CPUs) has ~75 GB free RAM — assigned coding models

Without numactl --membind, the OS allocates model weights and KV cache across both nodes, causing cross-socket memory traffic (~40 GB/s vs ~68–75 GB/s local). CPUAffinity alone sets the scheduler; numactl sets the memory policy.

OLLAMA_FLASH_ATTENTION

Enables fused softmax kernel — reduces attention memory bandwidth by ~20% and improves throughput at all context lengths on AVX2 (E5-2690v4). Note: OLLAMA_KV_CACHE_TYPE is intentionally not set — q8_0 dequantization overhead regressed throughput on this CPU despite the bandwidth savings.

Upgrade Procedure

ansible-playbook playbooks/02_infrastructure.yml -K -e @local.yml --tags ollama

The official install script detects the existing installation and performs an in-place upgrade. Both ollama.service and ollama-node0.service are restarted.

Tags

ansible-playbook playbooks/site.yml --tags ollama -K -e @local.yml