After resolving the dual NUMA/CPUAffinity performance regression (2026-03-10), two
post-fix benchmark runs were executed to validate the effect of raising
benchmark_num_predict from 300 to 500. This document captures the four-run history,
before/after comparison, full Run 4 model results, and findings on system tuning state.
benchmarks/results/| Run | Timestamp | Condition | Result |
|---|---|---|---|
| 1 | 20260309T080551 | Broken NUMA (membind + CPUAffinity) | quality=0, tok/sec≈0.0–0.1 |
| 2 | 20260309T174604 | Broken NUMA (same bug) | quality=0, tok/sec=0.1 |
| 3 | 20260310T094843 | Post-NUMA-fix, num_predict=300, 4 models | quality=0.78–0.97, tok/sec=6.5–22.8 |
| 4 | 20260310T110632 | Post-NUMA-fix, num_predict=500, 9 models | quality=0.83–0.97, tok/sec=3.2–25.0 |
| Model | coding_quality @ 300 | coding_quality @ 500 | Delta |
|---|---|---|---|
| deepseek-coder-v2:latest | 0.783 | 0.833 | +0.050 |
| qwen2.5-coder:7b | 0.800 | 0.850 | +0.050 |
| llama3.2:3b | 0.850 | 0.890 | +0.040 |
| gemma3:12b-it-q4_K_M | 0.850 | 0.873 | +0.023 |
| Model | tok/sec | coding_q | general_q | latency_ms | coding_composite | general_composite | category |
|---|---|---|---|---|---|---|---|
| deepseek-coder-v2:16b | 24.5 | 0.833 | 0.885 | 1415.1 | 0.738 | 0.762 | coding |
| deepseek-coder-v2:latest | 25.0 | 0.833 | 0.885 | 1543.2 | 0.735 | 0.758 | coding |
| qwen2.5-coder:latest | 12.8 | 0.850 | 0.910 | 1228.2 | 0.667 | 0.694 | coding |
| qwen2.5-coder:7b | 12.7 | 0.850 | 0.910 | 1231.9 | 0.666 | 0.693 | coding |
| qwen2.5-coder:14B | 6.6 | 0.850 | 0.931 | 2195.9 | 0.572 | 0.609 | coding |
| codellama:34b | 3.2 | 0.833 | 0.586 | 4244.1 | 0.437 | 0.326 | coding |
| llama3.2:3b | 22.3 | 0.890 | 0.954 | 644.2 | 0.785 | 0.814 | general |
| llama3.1:8b | 11.8 | 0.823 | 0.877 | 2249.3 | 0.596 | 0.621 | general |
| gemma3:12b-it-q4_K_M | 6.4 | 0.873 | 0.966 | 6355.8 | 0.441 | 0.483 | general |
| Slot | Socket | Role | Model | Composite |
|---|---|---|---|---|
| 1 | Node 1 (port 11434) | General (locked) | llama3.2:3b | 0.814 |
| 2 | Node 1 (port 11434) | General (locked) | llama3.1:8b | 0.621 |
| 3 | Node 0 (port 11435) | Coding (locked) | deepseek-coder-v2:16b | 0.738 |
| 4 | Node 0 (port 11435) | Coding (locked) | deepseek-coder-v2:latest | 0.735 |
| 5 | Node 1 (port 11434) | General (rotate) | gemma3:12b-it-q4_K_M | 0.483 |
| 6 | Node 0 (port 11435) | Coding (rotate) | qwen2.5-coder:latest | 0.667 |
| Variable | Value | Status |
|---|---|---|
benchmark_num_predict |
500 | Optimal — rubric ceiling is now the binding constraint |
benchmark_large_timeout |
480s | Adequate — 6–20x margin at current 3–25 tok/sec speeds |
benchmark_toks_norm_ceiling |
40 | Correct — fastest model at 62.5% of ceiling |
benchmark_coding_threshold |
0.10 | Correct — name-pattern fallback handling remaining cases |
| Scoring weights | 0.45/0.30/0.25 | Appropriate for interactive serving platform |
Finding 1 — num_predict=500 confirmed correct. Every model improved on coding_quality (+0.023 to +0.050). No timeouts observed. The rubric ceiling is now the binding constraint; further increases (700+) would yield at most +0.02 additional improvement.
Finding 2 — Coding quality inversion narrowed (expected, not a bug). Coding specialists
score lower on coding than general quality because general prompts don't require assert,
test_def, or type_hint (the hardest scoring markers). The gap halved from ~−0.110 to
~−0.052 vs. Run 3, confirming truncation was part of the cause. Name-pattern fallback
continues to correctly classify these models.
Finding 3 — deepseek-coder-v2:16b and :latest may be the same weights (ACTION REQUIRED).
Both share identical quality scores (0.833/0.885) and nearly identical throughput (24.5 vs.
25.0 tok/sec). In Ollama, :latest typically resolves to the same weights as the default
variant. If confirmed identical, slots 3 and 4 hold duplicate models — zero benefit, wasted
VRAM. See Testing Needed for verification steps.
Finding 4 — qwen2.5-coder:latest and :7b are near-identical (informational). Composites of 0.667 vs. 0.666. Lower impact since only one is active in slot 6 at a time.
Finding 5 — llama3.2:3b outperforms coding specialists on coding composite (informational). coding_composite=0.785 beats all dedicated coding models. Mathematically correct: speed (22.3 tok/sec) and latency (644ms) dominate. Correctly classified general because general_composite (0.814) > coding_composite (0.785), delta < 0.10 threshold.
Finding 6 — codellama:34b correctly excluded. 3.2 tok/sec, general_quality=0.586 falls below min_quality_score=0.6. Scoring system worked as designed.
Run on ai_server:
ollama show deepseek-coder-v2:16b --modelfile | grep FROM
ollama show deepseek-coder-v2:latest --modelfile | grep FROM
If digests match (same weights): update model_selection.json slot4_coding manually
(or remove one deepseek variant and re-run 03_benchmark.yml) to redirect slot 4 to
qwen2.5-coder:14B (composite=0.572) or another diverse candidate for model diversity.
If digests differ (different weights): no action — the pipeline is working as designed.
If slot4 is redirected, run:
ansible-playbook playbooks/04_models.yml -K -e @local.yml
Confirm both warmup services start cleanly:
systemctl status ollama-warmup.service ollama-warmup-node0.service
| Run | Timestamp | Condition | Models | Result |
|---|---|---|---|---|
| 1 | 20260309T080551 | Broken NUMA (membind + CPUAffinity) | — | quality=0, tok/sec≈0.0–0.1 |
| 2 | 20260309T174604 | Broken NUMA (same bug) | — | quality=0, tok/sec=0.1 |
| 3 | 20260310T094843 | Post-NUMA-fix, num_predict=300 | 4 | quality=0.78–0.97, tok/sec=6.5–22.8 |
| 4 | 20260310T110632 | num_predict=500, deepseek:latest present | 9 | quality=0.83–0.97, tok/sec=3.2–25.0 |
| 5 | 20260310T122818 | num_predict=500, deepseek:latest removed | 8 | quality=0.83–0.97, tok/sec=3.2–24.5 |
| Model | R4 tok/sec | R5 tok/sec | R4 coding_comp | R5 coding_comp | Delta |
|---|---|---|---|---|---|
| deepseek-coder-v2:16b | 24.5 | 24.1 | 0.738 | 0.727 | −0.011 (noise) |
| qwen2.5-coder:latest | 12.8 | 12.4 | 0.667 | 0.671 | +0.004 (noise) |
| qwen2.5-coder:7b | 12.7 | 12.6 | 0.666 | 0.674 | +0.008 (noise) |
| qwen2.5-coder:14B | 6.6 | 6.6 | 0.572 | 0.573 | +0.001 (noise) |
| llama3.2:3b | 22.3 | 24.5 | 0.785 | 0.806 | +0.021 (notable) |
| llama3.1:8b | 11.8 | 11.9 | 0.596 | 0.600 | +0.004 (noise) |
| gemma3:12b-it-q4_K_M | 6.4 | 6.2 | 0.441 | 0.439 | −0.002 (noise) |
| codellama:34b | 3.2 | 3.2 | 0.437 | 0.436 | −0.001 (noise) |
Quality scores (coding_quality, general_quality) are identical across both runs — confirming rubric scores are stable and deterministic at num_predict=500.
| Slot | Socket | Role | Model | Composite |
|---|---|---|---|---|
| 1 | Node 1 (port 11434) | General (locked) | llama3.2:3b | 0.835 |
| 2 | Node 1 (port 11434) | General (locked) | llama3.1:8b | 0.624 |
| 5 | Node 1 (port 11434) | General (rotate) | gemma3:12b-it-q4_K_M | 0.481 |
| 3 | Node 0 (port 11435) | Coding (locked) | deepseek-coder-v2:16b | 0.727 |
| 4 | Node 0 (port 11435) | Coding (locked) | qwen2.5-coder:7b | 0.674 |
| 6 | Node 0 (port 11435) | Coding (rotate) | qwen2.5-coder:latest | 0.671 |
Note: slot4 is qwen2.5-coder:7b — the pipeline correctly ranked it #2 coding (0.674),
superseding the manual qwen2.5-coder:14B edit made earlier this session.
Finding 1 — System is stable; tuning parameters remain optimal (no action). All quality scores are identical between Run 4 and Run 5. Speed and latency deltas are within normal run-to-run variance (±0.4 tok/sec, ±200ms TTFT). No tuning changes needed.
| Variable | Value | Status |
|---|---|---|
benchmark_num_predict |
500 | Optimal — rubric ceiling is binding constraint |
benchmark_large_timeout |
480s | Adequate — 6–20x margin at 3–25 tok/sec |
benchmark_toks_norm_ceiling |
40 | Correct — fastest model at 61% of ceiling |
benchmark_coding_threshold |
0.10 | Correct — name-pattern fallback working |
| Scoring weights | 0.45/0.30/0.25 | Appropriate for interactive serving |
Finding 2 — llama3.2:3b improved after deepseek:latest removal (informational). tok/sec: 22.3 → 24.5 (+2.2), general_composite: 0.814 → 0.835 (+0.021). Likely cause: removing one large model reduced memory pressure / NUMA contention during warmup. The 3b model benefits most as it runs fastest and competes most for memory bandwidth.
Finding 3 — qwen2.5-coder:7b and :latest confirmed duplicate weights (RESOLVED).
Run 5 slot4=:7b (0.674) and slot6=:latest (0.671) showed identical quality scores
(coding=0.850, general=0.910) and nearly identical throughput (~12.4–12.8 tok/sec) across
both runs — same pattern as the deepseek duplicate. Verified on ai_server:
qwen2.5-coder:7b → sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463
qwen2.5-coder:latest → sha256-60e05f2100071479f596b964f89f510f057ce397ea22f2833a0cfe029bfc2463
Digests match. qwen2.5-coder:latest removed. Next step: re-run 03_benchmark.yml (Run 6)
to promote qwen2.5-coder:14B to slot6_rotate, achieving genuine speed/quality diversity
on Node 0:
Finding 4 — gemma3:12b latency_score=0 is persistent (informational, no action). TTFT consistently 6.1–6.4 seconds, above the 5000ms floor → latency_score=0 every run. Hardware-limited (large quant loading time on Node 1), not a tuning issue. The model correctly holds slot5_general_rotate on the strength of general_quality=0.966. The latency penalty is working as intended.
Finding 5 — codellama:34b remains correctly excluded (informational, no action). composite=0.436, general_quality=0.586 — below both min_composite_score=0.55 and min_quality_score=0.6 every run. Pipeline working as designed.
Run 6: re-benchmark after qwen2.5-coder:latest removal to promote qwen2.5-coder:14B
to slot6_rotate and achieve model diversity on Node 0.
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml && \
ansible-playbook playbooks/04_models.yml -K -e @local.yml
| Run | Timestamp | Condition | Models | Result |
|---|---|---|---|---|
| 1 | 20260309T080551 | Broken NUMA (membind + CPUAffinity) | — | quality=0, tok/sec≈0.0–0.1 |
| 2 | 20260309T174604 | Broken NUMA (same bug) | — | quality=0, tok/sec=0.1 |
| 3 | 20260310T094843 | Post-NUMA-fix, num_predict=300 | 4 | quality=0.78–0.97, tok/sec=6.5–22.8 |
| 4 | 20260310T110632 | num_predict=500, deepseek:latest present | 9 | quality=0.83–0.97, tok/sec=3.2–25.0 |
| 5 | 20260310T122818 | num_predict=500, deepseek:latest removed | 8 | quality=0.83–0.97, tok/sec=3.2–24.5 |
| 6 | 20260310T160815 | num_predict=500, qwen2.5-coder:latest removed | 8 | quality=0.83–0.97, tok/sec=3.2–24.2 |
| Model | tok/sec | coding_q | general_q | latency_ms | coding_comp | general_comp | category |
|---|---|---|---|---|---|---|---|
| deepseek-coder-v2:16b | 24.2 | 0.833 | 0.885 | 1383.8 | 0.737 | 0.760 | coding |
| deepseek-coder-v2:latest | 24.1 | 0.833 | 0.885 | 1411.4 | 0.735 | 0.759 | coding |
| qwen2.5-coder:7b | 12.6 | 0.850 | 0.910 | 1210.0 | 0.666 | 0.693 | coding |
| qwen2.5-coder:14B | 6.6 | 0.850 | 0.931 | 2181.0 | 0.573 | 0.609 | coding |
| codellama:34b | 3.2 | 0.833 | 0.586 | 4336.2 | 0.432 | 0.321 | coding |
| llama3.2:3b | 24.2 | 0.890 | 0.954 | 581.0 | 0.803 | 0.832 | general |
| llama3.1:8b | 11.8 | 0.823 | 0.877 | 2183.4 | 0.600 | 0.624 | general |
| gemma3:12b-it-q4_K_M | 6.2 | 0.873 | 0.966 | 5540.1 | 0.440 | 0.482 | general |
| Model | R5 tok/sec | R6 tok/sec | R5 coding_comp | R6 coding_comp | Delta |
|---|---|---|---|---|---|
| deepseek-coder-v2:16b | 24.1 | 24.2 | 0.727 | 0.737 | +0.010 (noise) |
| qwen2.5-coder:7b | 12.6 | 12.6 | 0.674 | 0.666 | −0.008 (noise) |
| qwen2.5-coder:14B | 6.6 | 6.6 | 0.573 | 0.573 | 0.000 |
| llama3.2:3b | 24.5 | 24.2 | 0.806 | 0.803 | −0.003 (noise) |
| llama3.1:8b | 11.9 | 11.8 | 0.600 | 0.600 | 0.000 |
| gemma3:12b-it-q4_K_M | 6.2 | 6.2 | 0.439 | 0.440 | +0.001 (noise) |
| codellama:34b | 3.2 | 3.2 | 0.436 | 0.432 | −0.004 (noise) |
Quality scores are identical across all common models. All composites within run-to-run noise (≤ ±0.010). Rubric confirmed deterministic across 6 runs.
| Slot | Socket | Role | Model | Composite |
|---|---|---|---|---|
| 1 | Node 1 (port 11434) | General (locked) | llama3.2:3b | 0.832 |
| 2 | Node 1 (port 11434) | General (locked) | llama3.1:8b | 0.624 |
| 5 | Node 1 (port 11434) | General (rotate) | gemma3:12b-it-q4_K_M | 0.482 |
| 3 | Node 0 (port 11435) | Coding (locked) | deepseek-coder-v2:16b | 0.737 |
| 4 | Node 0 (port 11435) | Coding (locked) | deepseek-coder-v2:latest | 0.735 ← REGRESSION |
| 6 | Node 0 (port 11435) | Coding (rotate) | qwen2.5-coder:7b | 0.666 |
Finding 1 — deepseek-coder-v2:latest re-appeared in slot4 (REGRESSION, now fixed).
Previously confirmed duplicate of :16b and removed after Run 4. Re-appeared in Run 6
because group_vars/all.yml contained two pull sources:
baseline_models (line 121): "deepseek-coder-v2" — untagged, Ollama resolves to
:latest, re-pulling the duplicate on every benchmark run.candidate_models: explicit "deepseek-coder-v2:latest" entry — unconditionally pulls
:latest as a testable model.Fix applied to inventory/group_vars/all.yml:
baseline_models: changed "deepseek-coder-v2" → "deepseek-coder-v2:16b" (explicit tag)candidate_models: removed the deepseek-coder-v2:latest entry entirelyAlso required on ai_server: ollama rm deepseek-coder-v2:latest
Finding 2 — All scores and tuning variables remain stable (no action). Every delta vs Run 5 is within noise (≤ ±0.010 composite, quality scores identical). The rubric is confirmed deterministic across 6 runs.
| Variable | Value | Status |
|---|---|---|
benchmark_num_predict |
500 | Optimal |
benchmark_large_timeout |
480s | Adequate |
benchmark_toks_norm_ceiling |
40 | Correct |
benchmark_coding_threshold |
0.10 | Correct |
Finding 3 — qwen2.5-coder:14B not yet in slot6 (consequence of Finding 1). With deepseek:latest occupying slot4, the coding rank yields: #1 deepseek:16b (0.737) → slot3, #2 deepseek:latest (0.735) → slot4, #3 qwen:7b (0.666) → slot6, #4 qwen:14B (0.573) → excluded. After deepseek:latest is permanently removed, Run 7 expected layout: slot3=deepseek:16b, slot4=qwen:7b, slot6=qwen:14B.
Finding 4 — gemma3:12b TTFT=5540ms (informational, no action). Persistently above 5000ms floor → latency_score=0 every run. Hardware-limited, not a tuning issue. Correctly holds slot5_general_rotate on general_quality=0.966.
Finding 5 — codellama:34b correctly excluded again (informational, no action). composite=0.432, general_quality=0.586 — below both thresholds. Pipeline working as designed.
ollama rm deepseek-coder-v2:latestRun 7 (clean benchmark):
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml && \
ansible-playbook playbooks/04_models.yml -K -e @local.yml
Expected Run 7: slot4=qwen2.5-coder:7b, slot6=qwen2.5-coder:14B,
deepseek-coder-v2:latest absent from all_metrics.
| Run | Timestamp | Condition | Models | Result |
|---|---|---|---|---|
| 1 | 20260309T080551 | Broken NUMA (membind + CPUAffinity) | — | quality=0, tok/sec≈0.0–0.1 |
| 2 | 20260309T174604 | Broken NUMA (same bug) | — | quality=0, tok/sec=0.1 |
| 3 | 20260310T094843 | Post-NUMA-fix, num_predict=300 | 4 | quality=0.78–0.97, tok/sec=6.5–22.8 |
| 4 | 20260310T110632 | num_predict=500, deepseek:latest present | 9 | quality=0.83–0.97, tok/sec=3.2–25.0 |
| 5 | 20260310T122818 | num_predict=500, deepseek:latest removed | 8 | quality=0.83–0.97, tok/sec=3.2–24.5 |
| 6 | 20260310T160815 | num_predict=500, qwen2.5-coder:latest removed | 8 | quality=0.83–0.97, tok/sec=3.2–24.2 |
| 7 | 20260310T170013 | group_vars fix applied, deepseek:latest absent | 7 | quality=0.83–0.97, tok/sec=3.2–23.5 |
| Model | tok/sec | coding_q | general_q | latency_ms | coding_comp | general_comp | category |
|---|---|---|---|---|---|---|---|
| deepseek-coder-v2:16b | 23.5 | 0.833 | 0.885 | 1568.5 | 0.723 | 0.746 | coding |
| qwen2.5-coder:7b | 12.5 | 0.850 | 0.910 | 1431.0 | 0.655 | 0.682 | coding |
| qwen2.5-coder:14B | 6.6 | 0.850 | 0.931 | 2229.7 | 0.570 | 0.607 | coding |
| codellama:34b | 3.2 | 0.833 | 0.586 | 4235.4 | 0.437 | 0.326 | coding |
| llama3.2:3b | 23.0 | 0.890 | 0.954 | 754.8 | 0.786 | 0.814 | general |
| llama3.1:8b | 11.8 | 0.823 | 0.877 | 2202.0 | 0.599 | 0.623 | general |
| gemma3:12b-it-q4_K_M | 6.1 | 0.873 | 0.966 | 5941.9 | 0.439 | 0.481 | general |
deepseek-coder-v2:latest absent from all_metrics — group_vars fix verified working.
| Model | R6 tok/sec | R7 tok/sec | R6 coding_comp | R7 coding_comp | Delta |
|---|---|---|---|---|---|
| deepseek-coder-v2:16b | 24.2 | 23.5 | 0.737 | 0.723 | −0.014 (noise) |
| qwen2.5-coder:7b | 12.6 | 12.5 | 0.666 | 0.655 | −0.011 (noise) |
| qwen2.5-coder:14B | 6.6 | 6.6 | 0.573 | 0.570 | −0.003 (noise) |
| llama3.2:3b | 24.2 | 23.0 | 0.803 | 0.786 | −0.017 (noise) |
| llama3.1:8b | 11.8 | 11.8 | 0.600 | 0.599 | −0.001 (noise) |
| gemma3:12b-it-q4_K_M | 6.2 | 6.1 | 0.440 | 0.439 | −0.001 (noise) |
| codellama:34b | 3.2 | 3.2 | 0.432 | 0.437 | +0.005 (noise) |
Quality scores are identical across all common models. All composites within run-to-run noise (≤ ±0.017). Rubric confirmed deterministic across 7 runs.
| Slot | Socket | Role | Model | Composite |
|---|---|---|---|---|
| 1 | Node 1 (port 11434) | General (locked) | llama3.2:3b | 0.814 |
| 2 | Node 1 (port 11434) | General (locked) | llama3.1:8b | 0.623 |
| 5 | Node 1 (port 11434) | General (rotate) | gemma3:12b-it-q4_K_M | 0.481 |
| 3 | Node 0 (port 11435) | Coding (locked) | deepseek-coder-v2:16b | 0.723 |
| 4 | Node 0 (port 11435) | Coding (locked) | qwen2.5-coder:7b | 0.655 ✅ |
| 6 | Node 0 (port 11435) | Coding (rotate) | qwen2.5-coder:14B | 0.570 ✅ |
Finding 1 — Target Node 0 diversity layout achieved (RESOLVED). Run 7 confirms the intended three-tier Node 0 layout:
All three are genuinely distinct models with different speed/quality tradeoffs.
Finding 2 — group_vars fix verified working (RESOLVED). deepseek-coder-v2:latest is
absent from all_metrics. Explicit :16b tag in baseline_models prevents Ollama from
resolving to :latest on subsequent runs. The fix is durable — re-running 03_benchmark.yml
will not re-introduce the duplicate.
Finding 3 — All scores and tuning variables stable (no action). Every delta vs Run 6 is within noise (≤ ±0.017 composite, quality scores identical). The pipeline is confirmed deterministic and stable.
| Variable | Value | Status |
|---|---|---|
benchmark_num_predict |
500 | Optimal |
benchmark_large_timeout |
480s | Adequate |
benchmark_toks_norm_ceiling |
40 | Correct |
benchmark_coding_threshold |
0.10 | Correct |
Finding 4 — Benchmark pipeline declared stable. Session closed. Seven runs over two
days confirmed: NUMA fix correct, scoring rubric deterministic, duplicate-model detection
pattern documented, group_vars idempotent. No further benchmark runs or tuning changes are
needed unless new models are added to candidate_models.