Răsfoiți Sursa

fix(benchmark): refine model selection and enhance evaluation metrics

- Update model selection logic to ensure accurate detection of eligible models.
- Adjust evaluation metrics to provide clearer insights into performance, including improved latency calculations.
- Implement changes to the benchmark pipeline for better handling of async job attributes and item integrity.
- Reset model_selection.json to prevent issues from prior runs.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Shaun Arman 1 zi în urmă
părinte
comite
d9450d0c08

+ 54 - 0
benchmarks/results/benchmark_20260309T080551.md

@@ -0,0 +1,54 @@
+# Benchmark Results - 20260309T080551
+
+## Model Selection (6-slot / 2-socket)
+| Slot | Socket | Role | Model | Composite Score |
+|------|--------|------|-------|----------------|
+| 1 | Node 1 (port 11434) | General (locked) | llama3.2:3b | 0.001 |
+| 2 | Node 1 (port 11434) | General (locked) | gemma3:12b-it-q4_K_M | 0.0 |
+| 5 | Node 1 (port 11434) | General (rotate) | none | N/A |
+| 3 | Node 0 (port 11435) | Coding (locked) | qwen2.5-coder:7b | 0.316 |
+| 4 | Node 0 (port 11435) | Coding (locked) | deepseek-coder-v2:latest | 0.0 |
+| 6 | Node 0 (port 11435) | Coding (rotate) | none | N/A |
+
+## Detailed Metrics
+### deepseek-coder-v2:latest
+- **Category**: coding
+- **Coding Quality**: 0
+- **General Quality**: 0
+- **Avg Tokens/sec**: 0.0
+- **Latency (ms)**: 676104.3
+- **Coding Composite**: 0.0
+- **General Composite**: 0.0
+### llama3.2:3b
+- **Category**: general
+- **Coding Quality**: 0
+- **General Quality**: 0
+- **Avg Tokens/sec**: 0.1
+- **Latency (ms)**: 154480.0
+- **Coding Composite**: 0.001
+- **General Composite**: 0.001
+### gemma3:12b-it-q4_K_M
+- **Category**: general
+- **Coding Quality**: 0
+- **General Quality**: 0
+- **Avg Tokens/sec**: 0.0
+- **Latency (ms)**: 722357.3
+- **Coding Composite**: 0.0
+- **General Composite**: 0.0
+### qwen2.5-coder:7b
+- **Category**: coding
+- **Coding Quality**: 0.7
+- **General Quality**: 0
+- **Avg Tokens/sec**: 0.1
+- **Latency (ms)**: 145493.5
+- **Coding Composite**: 0.316
+- **General Composite**: 0.001
+
+## Scoring Formula
+- Composite = quality * 0.45 + token_speed_normalized * 0.30 + latency_score * 0.25
+- Speed normalized against 40 tok/sec ceiling (hardware-observed max)
+- Coding quality (per-prompt):
+  code_gen: has_def×0.20 + has_return×0.20 + has_docstring×0.15 + has_type_hint×0.15 + has_code_block×0.10 + has_assert×0.08 + has_test_def×0.07 + has_import×0.05
+  debug:    has_def×0.30 + has_return×0.30 + has_code_block×0.25 + has_assert×0.15
+  refactor: has_def×0.25 + has_return×0.25 + has_code_block×0.20 + has_type_hint×0.15 + has_import×0.15
+- Category: override dict → quality delta (coding_avg - general_avg >= 0.1) → name pattern (coder/codestral/codellama/starcoder) → general

+ 47 - 0
benchmarks/results/benchmark_20260309T174604.md

@@ -0,0 +1,47 @@
+# Benchmark Results - 20260309T174604
+
+## Model Selection (6-slot / 2-socket)
+
+
+| Slot | Socket              | Role             | Model            | Composite Score |
+| ---- | ------------------- | ---------------- | ---------------- | --------------- |
+| 1    | Node 1 (port 11434) | General (locked) | llama3.2:3b      | 0.001           |
+| 2    | Node 1 (port 11434) | General (locked) | llama3.2:3b      | 0.001           |
+| 5    | Node 1 (port 11434) | General (rotate) | none             | N/A             |
+| 3    | Node 0 (port 11435) | Coding (locked)  | qwen2.5-coder:7b | 0.001           |
+| 4    | Node 0 (port 11435) | Coding (locked)  | qwen2.5-coder:7b | 0.001           |
+| 6    | Node 0 (port 11435) | Coding (rotate)  | none             | N/A             |
+
+
+## Detailed Metrics
+
+### llama3.2:3b
+
+- **Category**: general
+- **Coding Quality**: 0
+- **General Quality**: 0
+- **Avg Tokens/sec**: 0.1
+- **Latency (ms)**: 108021.2
+- **Coding Composite**: 0.001
+- **General Composite**: 0.001
+
+### qwen2.5-coder:7b
+
+- **Category**: coding
+- **Coding Quality**: 0
+- **General Quality**: 0
+- **Avg Tokens/sec**: 0.1
+- **Latency (ms)**: 146781.6
+- **Coding Composite**: 0.001
+- **General Composite**: 0.001
+
+## Scoring Formula
+
+- Composite = quality * 0.45 + token_speed_normalized * 0.30 + latency_score * 0.25
+- Speed normalized against 40 tok/sec ceiling (hardware-observed max)
+- Coding quality (per-prompt):
+code_gen: has_def×0.20 + has_return×0.20 + has_docstring×0.15 + has_type_hint×0.15 + has_code_block×0.10 + has_assert×0.08 + has_test_def×0.07 + has_import×0.05
+debug:    has_def×0.30 + has_return×0.30 + has_code_block×0.25 + has_assert×0.15
+refactor: has_def×0.25 + has_return×0.25 + has_code_block×0.20 + has_type_hint×0.15 + has_import×0.15
+- Category: override dict → quality delta (coding_avg - general_avg >= 0.1) → name pattern (coder/codestral/codellama/starcoder) → general
+

+ 67 - 0
benchmarks/results/benchmark_20260310T094843.md

@@ -0,0 +1,67 @@
+# Benchmark Results - 20260310T094843
+
+## Model Selection (6-slot / 2-socket)
+
+
+| Slot | Socket              | Role             | Model                    | Composite Score |
+| ---- | ------------------- | ---------------- | ------------------------ | --------------- |
+| 1    | Node 1 (port 11434) | General (locked) | llama3.2:3b              | 0.814           |
+| 2    | Node 1 (port 11434) | General (locked) | gemma3:12b-it-q4_K_M     | 0.484           |
+| 5    | Node 1 (port 11434) | General (rotate) | none                     | N/A             |
+| 3    | Node 0 (port 11435) | Coding (locked)  | deepseek-coder-v2:latest | 0.693           |
+| 4    | Node 0 (port 11435) | Coding (locked)  | qwen2.5-coder:7b         | 0.638           |
+| 6    | Node 0 (port 11435) | Coding (rotate)  | none                     | N/A             |
+
+
+## Detailed Metrics
+
+### deepseek-coder-v2:latest
+
+- **Category**: coding
+- **Coding Quality**: 0.783
+- **General Quality**: 0.885
+- **Avg Tokens/sec**: 22.8
+- **Latency (ms)**: 1612.6
+- **Coding Composite**: 0.693
+- **General Composite**: 0.739
+
+### llama3.2:3b
+
+- **Category**: general
+- **Coding Quality**: 0.85
+- **General Quality**: 0.954
+- **Avg Tokens/sec**: 22.4
+- **Latency (ms)**: 661.8
+- **Coding Composite**: 0.767
+- **General Composite**: 0.814
+
+### gemma3:12b-it-q4_K_M
+
+- **Category**: general
+- **Coding Quality**: 0.85
+- **General Quality**: 0.966
+- **Avg Tokens/sec**: 6.5
+- **Latency (ms)**: 5730.8
+- **Coding Composite**: 0.431
+- **General Composite**: 0.484
+
+### qwen2.5-coder:7b
+
+- **Category**: coding
+- **Coding Quality**: 0.8
+- **General Quality**: 0.91
+- **Avg Tokens/sec**: 12.8
+- **Latency (ms)**: 1359.5
+- **Coding Composite**: 0.638
+- **General Composite**: 0.687
+
+## Scoring Formula
+
+- Composite = quality * 0.45 + token_speed_normalized * 0.30 + latency_score * 0.25
+- Speed normalized against 40 tok/sec ceiling (hardware-observed max)
+- Coding quality (per-prompt):
+code_gen: has_def×0.20 + has_return×0.20 + has_docstring×0.15 + has_type_hint×0.15 + has_code_block×0.10 + has_assert×0.08 + has_test_def×0.07 + has_import×0.05
+debug:    has_def×0.30 + has_return×0.30 + has_code_block×0.25 + has_assert×0.15
+refactor: has_def×0.25 + has_return×0.25 + has_code_block×0.20 + has_type_hint×0.15 + has_import×0.15
+- Category: override dict → quality delta (coding_avg - general_avg >= 0.1) → name pattern (coder/codestral/codellama/starcoder) → general
+

+ 117 - 0
benchmarks/results/benchmark_20260310T102149.md

@@ -0,0 +1,117 @@
+# Benchmark Results - 20260310T102149
+
+## Model Selection (6-slot / 2-socket)
+
+
+| Slot | Socket              | Role             | Model                    | Composite Score |
+| ---- | ------------------- | ---------------- | ------------------------ | --------------- |
+| 1    | Node 1 (port 11434) | General (locked) | llama3.2:3b              | 0.819           |
+| 2    | Node 1 (port 11434) | General (locked) | llama3.1:8b              | 0.621           |
+| 5    | Node 1 (port 11434) | General (rotate) | gemma3:12b-it-q4_K_M     | 0.484           |
+| 3    | Node 0 (port 11435) | Coding (locked)  | deepseek-coder-v2:16b    | 0.707           |
+| 4    | Node 0 (port 11435) | Coding (locked)  | deepseek-coder-v2:latest | 0.681           |
+| 6    | Node 0 (port 11435) | Coding (rotate)  | qwen2.5-coder:latest     | 0.644           |
+
+
+## Detailed Metrics
+
+### codellama:34b
+
+- **Category**: coding
+- **Coding Quality**: 0.783
+- **General Quality**: 0.586
+- **Avg Tokens/sec**: 3.2
+- **Latency (ms)**: 4350.0
+- **Coding Composite**: 0.409
+- **General Composite**: 0.32
+
+### deepseek-coder-v2:16b
+
+- **Category**: coding
+- **Coding Quality**: 0.783
+- **General Quality**: 0.885
+- **Avg Tokens/sec**: 24.6
+- **Latency (ms)**: 1586.8
+- **Coding Composite**: 0.707
+- **General Composite**: 0.753
+
+### qwen2.5-coder:14B
+
+- **Category**: coding
+- **Coding Quality**: 0.8
+- **General Quality**: 0.931
+- **Avg Tokens/sec**: 6.6
+- **Latency (ms)**: 2223.7
+- **Coding Composite**: 0.549
+- **General Composite**: 0.608
+
+### deepseek-coder-v2:latest
+
+- **Category**: coding
+- **Coding Quality**: 0.783
+- **General Quality**: 0.885
+- **Avg Tokens/sec**: 22.2
+- **Latency (ms)**: 1759.1
+- **Coding Composite**: 0.681
+- **General Composite**: 0.727
+
+### qwen2.5-coder:latest
+
+- **Category**: coding
+- **Coding Quality**: 0.8
+- **General Quality**: 0.91
+- **Avg Tokens/sec**: 12.8
+- **Latency (ms)**: 1239.2
+- **Coding Composite**: 0.644
+- **General Composite**: 0.694
+
+### llama3.1:8b
+
+- **Category**: general
+- **Coding Quality**: 0.8
+- **General Quality**: 0.877
+- **Avg Tokens/sec**: 11.8
+- **Latency (ms)**: 2251.2
+- **Coding Composite**: 0.586
+- **General Composite**: 0.621
+
+### qwen2.5-coder:7b
+
+- **Category**: coding
+- **Coding Quality**: 0.8
+- **General Quality**: 0.91
+- **Avg Tokens/sec**: 12.3
+- **Latency (ms)**: 1258.3
+- **Coding Composite**: 0.639
+- **General Composite**: 0.689
+
+### gemma3:12b-it-q4_K_M
+
+- **Category**: general
+- **Coding Quality**: 0.85
+- **General Quality**: 0.966
+- **Avg Tokens/sec**: 6.6
+- **Latency (ms)**: 5701.3
+- **Coding Composite**: 0.432
+- **General Composite**: 0.484
+
+### llama3.2:3b
+
+- **Category**: general
+- **Coding Quality**: 0.85
+- **General Quality**: 0.954
+- **Avg Tokens/sec**: 22.7
+- **Latency (ms)**: 613.5
+- **Coding Composite**: 0.772
+- **General Composite**: 0.819
+
+## Scoring Formula
+
+- Composite = quality * 0.45 + token_speed_normalized * 0.30 + latency_score * 0.25
+- Speed normalized against 40 tok/sec ceiling (hardware-observed max)
+- Coding quality (per-prompt):
+code_gen: has_def×0.20 + has_return×0.20 + has_docstring×0.15 + has_type_hint×0.15 + has_code_block×0.10 + has_assert×0.08 + has_test_def×0.07 + has_import×0.05
+debug:    has_def×0.30 + has_return×0.30 + has_code_block×0.25 + has_assert×0.15
+refactor: has_def×0.25 + has_return×0.25 + has_code_block×0.20 + has_type_hint×0.15 + has_import×0.15
+- Category: override dict → quality delta (coding_avg - general_avg >= 0.1) → name pattern (coder/codestral/codellama/starcoder) → general
+

+ 94 - 0
benchmarks/results/benchmark_20260310T110632.md

@@ -0,0 +1,94 @@
+# Benchmark Results - 20260310T110632
+
+## Model Selection (6-slot / 2-socket)
+| Slot | Socket | Role | Model | Composite Score |
+|------|--------|------|-------|----------------|
+| 1 | Node 1 (port 11434) | General (locked) | llama3.2:3b | 0.814 |
+| 2 | Node 1 (port 11434) | General (locked) | llama3.1:8b | 0.621 |
+| 5 | Node 1 (port 11434) | General (rotate) | gemma3:12b-it-q4_K_M | 0.483 |
+| 3 | Node 0 (port 11435) | Coding (locked) | deepseek-coder-v2:16b | 0.738 |
+| 4 | Node 0 (port 11435) | Coding (locked) | deepseek-coder-v2:latest | 0.735 |
+| 6 | Node 0 (port 11435) | Coding (rotate) | qwen2.5-coder:latest | 0.667 |
+
+## Detailed Metrics
+### codellama:34b
+- **Category**: coding
+- **Coding Quality**: 0.833
+- **General Quality**: 0.586
+- **Avg Tokens/sec**: 3.2
+- **Latency (ms)**: 4244.1
+- **Coding Composite**: 0.437
+- **General Composite**: 0.326
+### deepseek-coder-v2:latest
+- **Category**: coding
+- **Coding Quality**: 0.833
+- **General Quality**: 0.885
+- **Avg Tokens/sec**: 25.0
+- **Latency (ms)**: 1543.2
+- **Coding Composite**: 0.735
+- **General Composite**: 0.758
+### deepseek-coder-v2:16b
+- **Category**: coding
+- **Coding Quality**: 0.833
+- **General Quality**: 0.885
+- **Avg Tokens/sec**: 24.5
+- **Latency (ms)**: 1415.1
+- **Coding Composite**: 0.738
+- **General Composite**: 0.762
+### qwen2.5-coder:14B
+- **Category**: coding
+- **Coding Quality**: 0.85
+- **General Quality**: 0.931
+- **Avg Tokens/sec**: 6.6
+- **Latency (ms)**: 2195.9
+- **Coding Composite**: 0.572
+- **General Composite**: 0.609
+### qwen2.5-coder:latest
+- **Category**: coding
+- **Coding Quality**: 0.85
+- **General Quality**: 0.91
+- **Avg Tokens/sec**: 12.8
+- **Latency (ms)**: 1228.2
+- **Coding Composite**: 0.667
+- **General Composite**: 0.694
+### llama3.1:8b
+- **Category**: general
+- **Coding Quality**: 0.823
+- **General Quality**: 0.877
+- **Avg Tokens/sec**: 11.8
+- **Latency (ms)**: 2249.3
+- **Coding Composite**: 0.596
+- **General Composite**: 0.621
+### qwen2.5-coder:7b
+- **Category**: coding
+- **Coding Quality**: 0.85
+- **General Quality**: 0.91
+- **Avg Tokens/sec**: 12.7
+- **Latency (ms)**: 1231.9
+- **Coding Composite**: 0.666
+- **General Composite**: 0.693
+### gemma3:12b-it-q4_K_M
+- **Category**: general
+- **Coding Quality**: 0.873
+- **General Quality**: 0.966
+- **Avg Tokens/sec**: 6.4
+- **Latency (ms)**: 6355.8
+- **Coding Composite**: 0.441
+- **General Composite**: 0.483
+### llama3.2:3b
+- **Category**: general
+- **Coding Quality**: 0.89
+- **General Quality**: 0.954
+- **Avg Tokens/sec**: 22.3
+- **Latency (ms)**: 644.2
+- **Coding Composite**: 0.785
+- **General Composite**: 0.814
+
+## Scoring Formula
+- Composite = quality * 0.45 + token_speed_normalized * 0.30 + latency_score * 0.25
+- Speed normalized against 40 tok/sec ceiling (hardware-observed max)
+- Coding quality (per-prompt):
+  code_gen: has_def×0.20 + has_return×0.20 + has_docstring×0.15 + has_type_hint×0.15 + has_code_block×0.10 + has_assert×0.08 + has_test_def×0.07 + has_import×0.05
+  debug:    has_def×0.30 + has_return×0.30 + has_code_block×0.25 + has_assert×0.15
+  refactor: has_def×0.25 + has_return×0.25 + has_code_block×0.20 + has_type_hint×0.15 + has_import×0.15
+- Category: override dict → quality delta (coding_avg - general_avg >= 0.1) → name pattern (coder/codestral/codellama/starcoder) → general

+ 107 - 0
benchmarks/results/benchmark_20260310T122818.md

@@ -0,0 +1,107 @@
+# Benchmark Results - 20260310T122818
+
+## Model Selection (6-slot / 2-socket)
+
+
+| Slot | Socket              | Role             | Model                 | Composite Score |
+| ---- | ------------------- | ---------------- | --------------------- | --------------- |
+| 1    | Node 1 (port 11434) | General (locked) | llama3.2:3b           | 0.835           |
+| 2    | Node 1 (port 11434) | General (locked) | llama3.1:8b           | 0.624           |
+| 5    | Node 1 (port 11434) | General (rotate) | gemma3:12b-it-q4_K_M  | 0.481           |
+| 3    | Node 0 (port 11435) | Coding (locked)  | deepseek-coder-v2:16b | 0.727           |
+| 4    | Node 0 (port 11435) | Coding (locked)  | qwen2.5-coder:7b      | 0.674           |
+| 6    | Node 0 (port 11435) | Coding (rotate)  | qwen2.5-coder:latest  | 0.671           |
+
+
+## Detailed Metrics
+
+### codellama:34b
+
+- **Category**: coding
+- **Coding Quality**: 0.833
+- **General Quality**: 0.586
+- **Avg Tokens/sec**: 3.2
+- **Latency (ms)**: 4261.3
+- **Coding Composite**: 0.436
+- **General Composite**: 0.325
+
+### deepseek-coder-v2:16b
+
+- **Category**: coding
+- **Coding Quality**: 0.833
+- **General Quality**: 0.885
+- **Avg Tokens/sec**: 24.1
+- **Latency (ms)**: 1583.1
+- **Coding Composite**: 0.727
+- **General Composite**: 0.75
+
+### qwen2.5-coder:14B
+
+- **Category**: coding
+- **Coding Quality**: 0.85
+- **General Quality**: 0.931
+- **Avg Tokens/sec**: 6.6
+- **Latency (ms)**: 2172.1
+- **Coding Composite**: 0.573
+- **General Composite**: 0.61
+
+### qwen2.5-coder:latest
+
+- **Category**: coding
+- **Coding Quality**: 0.85
+- **General Quality**: 0.91
+- **Avg Tokens/sec**: 12.4
+- **Latency (ms)**: 1102.0
+- **Coding Composite**: 0.671
+- **General Composite**: 0.698
+
+### llama3.1:8b
+
+- **Category**: general
+- **Coding Quality**: 0.823
+- **General Quality**: 0.877
+- **Avg Tokens/sec**: 11.9
+- **Latency (ms)**: 2186.7
+- **Coding Composite**: 0.6
+- **General Composite**: 0.624
+
+### qwen2.5-coder:7b
+
+- **Category**: coding
+- **Coding Quality**: 0.85
+- **General Quality**: 0.91
+- **Avg Tokens/sec**: 12.6
+- **Latency (ms)**: 1073.7
+- **Coding Composite**: 0.674
+- **General Composite**: 0.701
+
+### gemma3:12b-it-q4_K_M
+
+- **Category**: general
+- **Coding Quality**: 0.873
+- **General Quality**: 0.966
+- **Avg Tokens/sec**: 6.2
+- **Latency (ms)**: 6142.8
+- **Coding Composite**: 0.439
+- **General Composite**: 0.481
+
+### llama3.2:3b
+
+- **Category**: general
+- **Coding Quality**: 0.89
+- **General Quality**: 0.954
+- **Avg Tokens/sec**: 24.5
+- **Latency (ms)**: 568.5
+- **Coding Composite**: 0.806
+- **General Composite**: 0.835
+
+## Scoring Formula
+
+- Composite = quality * 0.45 + token_speed_normalized * 0.30 + latency_score * 0.25
+- Speed normalized against 40 tok/sec ceiling (hardware-observed max)
+- Coding quality (per-prompt):
+code_gen: has_def×0.20 + has_return×0.20 + has_docstring×0.15 + has_type_hint×0.15 + has_code_block×0.10 + has_assert×0.08 + has_test_def×0.07 + has_import×0.05
+debug:    has_def×0.30 + has_return×0.30 + has_code_block×0.25 + has_assert×0.15
+refactor: has_def×0.25 + has_return×0.25 + has_code_block×0.20 + has_type_hint×0.15 + has_import×0.15
+- Category: override dict → quality delta (coding_avg - general_avg >= 0.1) → name pattern (coder/codestral/codellama/starcoder) → general
+

+ 107 - 0
benchmarks/results/benchmark_20260310T160815.md

@@ -0,0 +1,107 @@
+# Benchmark Results - 20260310T160815
+
+## Model Selection (6-slot / 2-socket)
+
+
+| Slot | Socket              | Role             | Model                    | Composite Score |
+| ---- | ------------------- | ---------------- | ------------------------ | --------------- |
+| 1    | Node 1 (port 11434) | General (locked) | llama3.2:3b              | 0.832           |
+| 2    | Node 1 (port 11434) | General (locked) | llama3.1:8b              | 0.624           |
+| 5    | Node 1 (port 11434) | General (rotate) | gemma3:12b-it-q4_K_M     | 0.482           |
+| 3    | Node 0 (port 11435) | Coding (locked)  | deepseek-coder-v2:16b    | 0.737           |
+| 4    | Node 0 (port 11435) | Coding (locked)  | deepseek-coder-v2:latest | 0.735           |
+| 6    | Node 0 (port 11435) | Coding (rotate)  | qwen2.5-coder:7b         | 0.666           |
+
+
+## Detailed Metrics
+
+### codellama:34b
+
+- **Category**: coding
+- **Coding Quality**: 0.833
+- **General Quality**: 0.586
+- **Avg Tokens/sec**: 3.2
+- **Latency (ms)**: 4336.2
+- **Coding Composite**: 0.432
+- **General Composite**: 0.321
+
+### deepseek-coder-v2:latest
+
+- **Category**: coding
+- **Coding Quality**: 0.833
+- **General Quality**: 0.885
+- **Avg Tokens/sec**: 24.1
+- **Latency (ms)**: 1411.4
+- **Coding Composite**: 0.735
+- **General Composite**: 0.759
+
+### deepseek-coder-v2:16b
+
+- **Category**: coding
+- **Coding Quality**: 0.833
+- **General Quality**: 0.885
+- **Avg Tokens/sec**: 24.2
+- **Latency (ms)**: 1383.8
+- **Coding Composite**: 0.737
+- **General Composite**: 0.76
+
+### qwen2.5-coder:14B
+
+- **Category**: coding
+- **Coding Quality**: 0.85
+- **General Quality**: 0.931
+- **Avg Tokens/sec**: 6.6
+- **Latency (ms)**: 2181.0
+- **Coding Composite**: 0.573
+- **General Composite**: 0.609
+
+### llama3.1:8b
+
+- **Category**: general
+- **Coding Quality**: 0.823
+- **General Quality**: 0.877
+- **Avg Tokens/sec**: 11.8
+- **Latency (ms)**: 2183.4
+- **Coding Composite**: 0.6
+- **General Composite**: 0.624
+
+### qwen2.5-coder:7b
+
+- **Category**: coding
+- **Coding Quality**: 0.85
+- **General Quality**: 0.91
+- **Avg Tokens/sec**: 12.6
+- **Latency (ms)**: 1210.0
+- **Coding Composite**: 0.666
+- **General Composite**: 0.693
+
+### gemma3:12b-it-q4_K_M
+
+- **Category**: general
+- **Coding Quality**: 0.873
+- **General Quality**: 0.966
+- **Avg Tokens/sec**: 6.2
+- **Latency (ms)**: 5540.1
+- **Coding Composite**: 0.44
+- **General Composite**: 0.482
+
+### llama3.2:3b
+
+- **Category**: general
+- **Coding Quality**: 0.89
+- **General Quality**: 0.954
+- **Avg Tokens/sec**: 24.2
+- **Latency (ms)**: 581.0
+- **Coding Composite**: 0.803
+- **General Composite**: 0.832
+
+## Scoring Formula
+
+- Composite = quality * 0.45 + token_speed_normalized * 0.30 + latency_score * 0.25
+- Speed normalized against 40 tok/sec ceiling (hardware-observed max)
+- Coding quality (per-prompt):
+code_gen: has_def×0.20 + has_return×0.20 + has_docstring×0.15 + has_type_hint×0.15 + has_code_block×0.10 + has_assert×0.08 + has_test_def×0.07 + has_import×0.05
+debug:    has_def×0.30 + has_return×0.30 + has_code_block×0.25 + has_assert×0.15
+refactor: has_def×0.25 + has_return×0.25 + has_code_block×0.20 + has_type_hint×0.15 + has_import×0.15
+- Category: override dict → quality delta (coding_avg - general_avg >= 0.1) → name pattern (coder/codestral/codellama/starcoder) → general
+