|
|
1 napja | |
|---|---|---|
| .. | ||
| results | 1 napja | |
| README.md | 5 napja | |
Dynamic benchmark system for all installed Ollama models. Runs a suite of coding and general-purpose tests against every model on the Ollama server, scores each model on a composite metric, and assigns models to the 6-slot dual-socket system based on results.
Modelfile aliases (coder-128k, coder-32k, coder-rotate, llama-family,
gemma-family) are automatically excluded from benchmarking — they share weights with
real models and their large context window parameters would stall every run with
285-second KV-cache allocations.
Benchmark all installed models:
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml
Benchmark specific models only:
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml \
-e "benchmark_models=qwen2.5-coder:14b,deepseek-coder-v2:16b"
Benchmark and immediately push 6-slot warm-up selections:
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml && \
ansible-playbook playbooks/04_models.yml -K -e @local.yml
Models are split into three size tiers before benchmarking. Each tier gets its own per-request timeout to avoid small models waiting behind 70 B giants:
| Tier | RAM threshold | Timeout | Description |
|---|---|---|---|
| Small | < 10 GB | 300 s | 7 B and under — fast path |
| Medium | 10–15 GB | 900 s | 16 B lite / 12 B — standard wait |
| Large | > 15 GB | 1200 s | 34 B+ — 20-minute ceiling |
Size source vs runtime RAM: ollama list reports on-disk (compressed) sizes, which
are smaller than actual runtime RAM usage (model weights + KV cache + overhead). A
benchmark_size_overhead_factor (default 1.2) is applied when computing tier
boundaries: the disk-size cutoffs are divided by the factor before comparison. For
example, with default settings a 9 GB on-disk model is treated as ~10.8 GB at runtime
and falls in the medium tier rather than small.
Override tier boundaries:
# Adjust where small/medium boundary sits
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml \
-e "benchmark_small_max_gb=8 benchmark_medium_max_gb=20"
# Tune the overhead factor if your models load larger/smaller than expected
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml \
-e "benchmark_size_overhead_factor=1.25"
# Override timeouts only
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml \
-e "benchmark_medium_timeout=600 benchmark_large_timeout=1800"
| Test | Prompt | What Is Measured |
|---|---|---|
code_gen |
Write a Python merge sort with type hints, docstring, and 3 unit tests | def, return, """, ->, assert, def test_, import |
debug |
Find and fix 3 bugs in a given Python function | def, return, code block, assert |
refactor |
Refactor a loop for readability and performance | def, return, code block, type hint, import |
| Test | Prompt | What Is Measured |
|---|---|---|
explain |
Explain how Python's GIL works and when it matters | Response length, paragraph structure, list formatting |
creative |
Suggest 5 fun family activities for a rainy weekend | Response length, paragraph structure, list formatting |
reasoning |
Apple arithmetic word problem | Response length, paragraph structure, list formatting |
| Test | Prompt | What Is Measured |
|---|---|---|
latency |
"Hi" | Total response time (eval + prompt eval), used as TTFT proxy |
For each category (coding, general), a composite score is calculated:
composite = (quality * 0.45) + (tokens_per_sec / ceiling, capped 1.0) * 0.30
+ (1 - ttft_ms / 5000, floored 0) * 0.25
Where:
quality — 0.0–1.0 from heuristic checks per test type (see CLAUDE.md for weights)tokens_per_sec — averaged across all test responses; normalized against benchmark_toks_norm_ceiling (default 40)ttft_ms — latency test response time in millisecondsA model is classified as coding if:
coding_composite - general_composite >= benchmark_coding_threshold # default 0.10
Name-pattern heuristics (coder, codestral, codellama, starcoder) apply as a
tiebreaker. Category can also be forced with model_category_overrides in group_vars/all.yml.
All thresholds are configurable in inventory/group_vars/all.yml:
| Key | Default | Description |
|---|---|---|
benchmark_thresholds.min_tokens_per_sec |
5.0 | Minimum tok/sec to be slot-eligible |
benchmark_thresholds.min_quality_score |
0.6 | Minimum quality score to be slot-eligible |
benchmark_thresholds.min_composite_score |
0.55 | Minimum composite to avoid threshold warning |
benchmark_toks_norm_ceiling |
40 | tok/sec ceiling for normalization (dual-socket target) |
benchmark_coding_threshold |
0.10 | coding-general composite delta for classification |
benchmark_small_max_gb |
10 | Runtime RAM upper bound for small pass (GB) |
benchmark_medium_max_gb |
15 | Runtime RAM upper bound for medium pass (GB) |
benchmark_size_overhead_factor |
1.2 | Multiplier applied to ollama list disk sizes to estimate runtime RAM |
benchmark_small_timeout |
300 | Per-request timeout for small models (seconds) |
benchmark_medium_timeout |
900 | Per-request timeout for medium models (seconds) |
benchmark_large_timeout |
1200 | Per-request timeout for large models (seconds) |
benchmark_skip_aliases |
see below | Modelfile aliases excluded from benchmark loop |
Default benchmark_skip_aliases:
- coder-128k
- coder-32k
- coder-rotate
- llama-family
- gemma-family
Each run produces benchmarks/results/benchmark_<timestamp>.md. The slot table now
covers all 6 slots across both NUMA instances:
| Slot | Socket | Role | Model | Composite |
|------|---------------------|-----------------|---------------------------|-----------|
| 1 | Node 1 (port 11434) | General (locked)| llama3.1:8b | 0.74 |
| 2 | Node 1 (port 11434) | General (locked)| mistral:latest | 0.71 |
| 5 | Node 1 (port 11434) | General (rotate)| llama3.2:3b | 0.63 |
| 3 | Node 0 (port 11435) | Coding (locked) | deepseek-coder-v2:16b | 0.82 |
| 4 | Node 0 (port 11435) | Coding (locked) | qwen2.5-coder:7b | 0.78 |
| 6 | Node 0 (port 11435) | Coding (rotate) | codegemma:7b | 0.69 |
Results are written to benchmarks/results/model_selection.json:
{
"slot1_general": "llama3.1:8b",
"slot2_general": "mistral:latest",
"slot5_general_rotate": "llama3.2:3b",
"slot3_coding": "deepseek-coder-v2:16b",
"slot4_coding": "qwen2.5-coder:7b",
"slot6_coding_rotate": "codegemma:7b",
"general_ranking": [...],
"coding_ranking": [...],
"all_metrics": { ... }
}
This file is read by 04_models.yml to decide what to pull and warm up. It is committed
to the repo so slot selections survive a clean checkout.