1 day ago · bf99e921b9
--- a/HANDOFF.md
+++ b/HANDOFF.md
@@ -1,60 +0,0 @@
 
															-# Session Handoff — 2026-03-09
														
 
															-
														
 
															-## Branch
														
 
															-
														
 
															-`feature/three-pass-benchmark`
														
 
															-
														
 
															-## What Was Just Done
														
 
															-
														
 
															-Fixed two bugs that caused near-zero tok/sec and latency=9999 in the concurrent benchmark.
														
 
															-
														
 
															-### Bug 1 — Queue contamination (tok/sec ≈ 0.04–0.08)
														
 
															-
														
 
															-**Root cause:** Benchmark was firing all 21 requests (3 models × 7 prompts) at once via
														
 
															-async uri. With `OLLAMA_NUM_PARALLEL=2`, only 2 slots run at a time; the other 19 queue.
														
 
															-`eval_duration` includes queue-wait time → tok/sec measured as ~0.18 instead of ~22.
														
 
															-
														
 
															-**Fix:** `playbooks/_bench_tier_batch.yml` — replaced the 7 async fire+collect tasks
														
 
															-with 5 synchronous `uri` tasks (no `async`/`poll`). Node1 runs all its prompts first,
														
 
															-then node0. One request at a time = idle slot = clean eval_duration.
														
 
															-
														
 
															-Key structural changes:
														
 
															-
														
 
															-- No more `_bench_node1_jobs` / `_bench_node0_jobs` intermediate registers
														
 
															-- No more `async_status` collect tasks
														
 
															-- Accumulate tasks now use `item.item` (top-level in sync uri) instead of
														
 
															-`item._async_job.item` (the old async indirection)
														
 
															-- `timeout` reverted from `benchmark_large_timeout × 15` to plain `benchmark_large_timeout`
														
 
															-
														
 
															-### Bug 2 — latency_ms = 9999 (latency_ns = 0)
														
 
															-
														
 
															-**Root cause:** Load phase warm-up "Hi" populates KV cache. Benchmark "Hi" hits cache →
														
 
															-`prompt_eval_duration ≈ 0` and `eval_duration ≈ 0` → sum = 0 → latency_ms = 9999.
														
 
															-
														
 
															-**Fix:** `playbooks/03_benchmark.yml` line 268 — changed latency measurement from
														
 
															-`eval_duration + prompt_eval_duration` to `resp.total_duration | default(0) | int`.
														
 
															-`total_duration` is true wall-clock end-to-end; never zero for a completed request.
														
 
															-
														
 
															-## State of the Branch
														
 
															-
														
 
															-Both fixes are committed as `d9a991f`. Nothing is pending.
														
 
															-
														
 
															-## Expected Results After Fix
														
 
															-
														
 
															-- `avg_tok_per_sec` ≈ 5–25 (was 0.04–0.08)
														
 
															-- `latency_ms` ≈ 300–6000 ms (was 9999)
														
 
															-- Composite scores ≈ 0.50–0.85 (was ≈ 0.45 flat)
														
 
															-
														
 
															-## Next Steps
														
 
															-
														
 
															-1. Run benchmark to verify the fix:
														
 
															-
														
 
															-1. If scores look correct, update warm-up slots:
														
 
															-
														
 
															-```bash
														
 
															-ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml && \
														
 
															-ansible-playbook playbooks/04_models.yml -K -e @local.yml
														
 
															-```
														
 
															-
														
 
															-1. Merge `feature/three-pass-benchmark` into `master` when satisfied.
														
 
															-
														
--- a/playbooks/03_benchmark.yml
+++ b/playbooks/03_benchmark.yml
@@ -393,12 +393,12 @@
 
															           ## Model Selection (6-slot / 2-socket)
														
 
															           | Slot | Socket | Role | Model | Composite Score |
														
 
															           |------|--------|------|-------|----------------|
														
 
															-          | 1 | Node 1 (port 11434) | General (locked) | {{ selection.slot1_general }} | {{ parsed_metrics[selection.slot1_general].general_composite | default('N/A') }} |
														
 
															-          | 2 | Node 1 (port 11434) | General (locked) | {{ selection.slot2_general }} | {{ parsed_metrics[selection.slot2_general].general_composite | default('N/A') }} |
														
 
															-          | 5 | Node 1 (port 11434) | General (rotate) | {{ selection.slot5_general_rotate }} | {{ parsed_metrics[selection.slot5_general_rotate].general_composite | default('N/A') }} |
														
 
															-          | 3 | Node 0 (port 11435) | Coding (locked) | {{ selection.slot3_coding }} | {{ parsed_metrics[selection.slot3_coding].coding_composite | default('N/A') }} |
														
 
															-          | 4 | Node 0 (port 11435) | Coding (locked) | {{ selection.slot4_coding }} | {{ parsed_metrics[selection.slot4_coding].coding_composite | default('N/A') }} |
														
 
															-          | 6 | Node 0 (port 11435) | Coding (rotate) | {{ selection.slot6_coding_rotate }} | {{ parsed_metrics[selection.slot6_coding_rotate].coding_composite | default('N/A') }} |
														
 
															+          | 1 | Node 1 (port 11434) | General (locked) | {{ selection.slot1_general }} | {{ (parsed_metrics[selection.slot1_general].general_composite if selection.slot1_general in parsed_metrics else 'N/A') }} |
														
 
															+          | 2 | Node 1 (port 11434) | General (locked) | {{ selection.slot2_general }} | {{ (parsed_metrics[selection.slot2_general].general_composite if selection.slot2_general in parsed_metrics else 'N/A') }} |
														
 
															+          | 5 | Node 1 (port 11434) | General (rotate) | {{ selection.slot5_general_rotate }} | {{ (parsed_metrics[selection.slot5_general_rotate].general_composite if selection.slot5_general_rotate in parsed_metrics else 'N/A') }} |
														
 
															+          | 3 | Node 0 (port 11435) | Coding (locked) | {{ selection.slot3_coding }} | {{ (parsed_metrics[selection.slot3_coding].coding_composite if selection.slot3_coding in parsed_metrics else 'N/A') }} |
														
 
															+          | 4 | Node 0 (port 11435) | Coding (locked) | {{ selection.slot4_coding }} | {{ (parsed_metrics[selection.slot4_coding].coding_composite if selection.slot4_coding in parsed_metrics else 'N/A') }} |
														
 
															+          | 6 | Node 0 (port 11435) | Coding (rotate) | {{ selection.slot6_coding_rotate }} | {{ (parsed_metrics[selection.slot6_coding_rotate].coding_composite if selection.slot6_coding_rotate in parsed_metrics else 'N/A') }} |
														
 
															           ## Detailed Metrics
														
 
															           {% for model, metrics in parsed_metrics.items() %}
														
--- a/templates/ollama/ollama-node0.service.j2
+++ b/templates/ollama/ollama-node0.service.j2
@@ -7,7 +7,7 @@ Wants=network-online.target
 
															 ExecStart=/usr/bin/numactl --cpunodebind=0 {{ ollama_binary_path }} serve
														
 
															 Environment="OLLAMA_API_KEY={{ ollama_api_key }}"
														
 
															 Environment="OLLAMA_HOST=0.0.0.0:{{ ollama_node0_port }}"
														
 
															-Environment="OLLAMA_MODELS={{ ollama_models_path }}"
														
 
															+Environment="OLLAMA_MODELS=/mnt/ai_data/ollama_models"
														
 
															 Environment="OLLAMA_KEEP_ALIVE={{ ollama_keep_alive }}"
														
 
															 Environment="OLLAMA_FLASH_ATTENTION={{ ollama_flash_attention }}"
														
 
															 Environment="OLLAMA_NUM_THREADS={{ ollama_num_threads }}"