README.md 4.0 KB

Role: models

Purpose

Manage the Ollama model lifecycle — pulling models, creating custom Modelfile configurations, and running warm-up services to ensure models are loaded into RAM at boot time across both NUMA instances.

6-Slot System

Slot Instance Port Role Selection Rotation
1 Node 1 11434 General (locked) Top general composite Re-benchmark only
2 Node 1 11434 General (locked) 2nd general composite Re-benchmark only
5 Node 1 11434 General (rotate) 3rd general composite -e slot5_model=<name>
3 Node 0 11435 Coding (locked) Top coding composite Re-benchmark only
4 Node 0 11435 Coding (locked) 2nd coding composite Re-benchmark only
6 Node 0 11435 Coding (rotate) 3rd coding composite -e slot6_model=<name>

Slot Rotation

Rotate the general slot on Node 1 (port 11434):

ansible-playbook playbooks/04_models.yml -K -e @local.yml -e "slot5_model=mistral:latest"

Rotate the coding slot on Node 0 (port 11435):

ansible-playbook playbooks/04_models.yml -K -e @local.yml -e "slot6_model=llama3.1:70b"

Both at once:

ansible-playbook playbooks/04_models.yml -K -e @local.yml \
  -e "slot5_model=mistral:latest" -e "slot6_model=command-r:35b"

Reset both rotate slots back to benchmark recommendations:

ansible-playbook playbooks/04_models.yml -K -e @local.yml

Modelfile Configurations

Custom Modelfile variants are created for fine-tuned context windows:

Custom Model Base Slot Context Port Use Case
coder-128k slot3_coding 32768 11435 Primary coding (large context)
coder-32k slot4_coding 32768 11435 Secondary coding
coder-rotate slot6_coding_rotate 32768 11435 Rotatable coding model
llama-family llama3.2:3b 8192 11434 Family-safe general assistant
gemma-family llama3.1:8b 8192 11434 Family-safe general assistant

These aliases are excluded from benchmarking via benchmark_skip_aliases — their 32k-token parameter allocations stall the benchmark loop with 285-second responses.

Warm-up Services

Two oneshot systemd services pre-load models after their respective Ollama instances start:

Service Warms Instance
ollama-warmup.service slots 1, 2, 5 Node 1 (port 11434)
ollama-warmup-node0.service slots 3, 4, 6 Node 0 (port 11435)

OLLAMA_KEEP_ALIVE=-1 keeps models pinned once loaded. The warmup services only need to run once after boot; subsequent requests hit already-loaded models immediately.

Check warmup status:

systemctl status ollama-warmup ollama-warmup-node0

Re-run warmup manually (e.g. after rotating a slot):

systemctl restart ollama-warmup          # Node 1 general models
systemctl restart ollama-warmup-node0    # Node 0 coding models

model_selection.json

playbooks/04_models.yml reads benchmarks/results/model_selection.json:

{
  "slot1_general": "llama3.1:8b",
  "slot2_general": "mistral:latest",
  "slot5_general_rotate": "llama3.2:3b",
  "slot3_coding": "deepseek-coder-v2:16b",
  "slot4_coding": "qwen2.5-coder:7b",
  "slot6_coding_rotate": "codegemma:7b",
  "general_ranking": [...],
  "coding_ranking": [...],
  "all_metrics": { ... }
}

Tags

ansible-playbook playbooks/site.yml --tags models -K -e @local.yml
ansible-playbook playbooks/site.yml --tags models-warmup -K -e @local.yml