Просмотр исходного кода

Enhance benchmark pipeline with improved async handling and error detection

Refactor the benchmark pipeline to better manage async job attributes, ensuring that the item attribute is correctly accessed from the _async_job. This change improves eligible model detection by chaining the mapping of _async_job to item. Additionally, replace the single accumulate task with per-node loops to maintain the integrity of the item data before the compute task.

Update the outer batch loop in 03_benchmark.yml to include loop_var: _batch, addressing Ansible loop-variable collision warnings. Reset model_selection.json to prevent issues from previous runs.
Shaun Arman 4 дней назад
Родитель
Сommit
08203de7f5

+ 70 - 0
benchmarks/results/benchmark_20260308T145246.md

@@ -0,0 +1,70 @@
+# Benchmark Results - 20260308T145246
+
+## Model Selection (6-slot / 2-socket)
+| Slot | Socket | Role | Model | Composite Score |
+|------|--------|------|-------|----------------|
+| 1 | Node 1 (port 11434) | General (locked) | llama3.2:3b | 0.001 |
+| 2 | Node 1 (port 11434) | General (locked) | mistral-nemo:latest | 0.0 |
+| 5 | Node 1 (port 11434) | General (rotate) | mistral:latest | 0.0 |
+| 3 | Node 0 (port 11435) | Coding (locked) | qwen2.5-coder:7b | 0.0 |
+| 4 | Node 0 (port 11435) | Coding (locked) | qwen2.5-coder:7b | 0.0 |
+| 6 | Node 0 (port 11435) | Coding (rotate) | none | N/A |
+
+## Detailed Metrics
+### mistral-nemo:latest
+- **Category**: general
+- **Coding Quality**: 0
+- **General Quality**: 0
+- **Avg Tokens/sec**: 0.0
+- **Latency (ms)**: 9999
+- **Coding Composite**: 0.0
+- **General Composite**: 0.0
+### mistral:latest
+- **Category**: general
+- **Coding Quality**: 0
+- **General Quality**: 0
+- **Avg Tokens/sec**: 0.0
+- **Latency (ms)**: 9999
+- **Coding Composite**: 0.0
+- **General Composite**: 0.0
+### llama3.1:8b
+- **Category**: general
+- **Coding Quality**: 0
+- **General Quality**: 0
+- **Avg Tokens/sec**: 0.0
+- **Latency (ms)**: 9999
+- **Coding Composite**: 0.0
+- **General Composite**: 0.0
+### qwen2.5-coder:7b
+- **Category**: coding
+- **Coding Quality**: 0
+- **General Quality**: 0
+- **Avg Tokens/sec**: 0.0
+- **Latency (ms)**: 9999
+- **Coding Composite**: 0.0
+- **General Composite**: 0.0
+### gemma3:12b-it-q4_K_M
+- **Category**: general
+- **Coding Quality**: 0
+- **General Quality**: 0
+- **Avg Tokens/sec**: 0.0
+- **Latency (ms)**: 9999
+- **Coding Composite**: 0.0
+- **General Composite**: 0.0
+### llama3.2:3b
+- **Category**: general
+- **Coding Quality**: 0
+- **General Quality**: 0
+- **Avg Tokens/sec**: 0.1
+- **Latency (ms)**: 109301.3
+- **Coding Composite**: 0.001
+- **General Composite**: 0.001
+
+## Scoring Formula
+- Composite = quality * 0.45 + token_speed_normalized * 0.30 + latency_score * 0.25
+- Speed normalized against 40 tok/sec ceiling (hardware-observed max)
+- Coding quality (per-prompt):
+  code_gen: has_def×0.20 + has_return×0.20 + has_docstring×0.15 + has_type_hint×0.15 + has_code_block×0.10 + has_assert×0.08 + has_test_def×0.07 + has_import×0.05
+  debug:    has_def×0.30 + has_return×0.30 + has_code_block×0.25 + has_assert×0.15
+  refactor: has_def×0.25 + has_return×0.25 + has_code_block×0.20 + has_type_hint×0.15 + has_import×0.15
+- Category: override dict → quality delta (coding_avg - general_avg >= 0.1) → name pattern (coder/codestral/codellama/starcoder) → general

+ 57 - 0
benchmarks/results/benchmark_20260308T215747.md

@@ -0,0 +1,57 @@
+# Benchmark Results - 20260308T215747
+
+## Model Selection (6-slot / 2-socket)
+
+
+| Slot | Socket              | Role             | Model               | Composite Score |
+| ---- | ------------------- | ---------------- | ------------------- | --------------- |
+| 1    | Node 1 (port 11434) | General (locked) | llama3.2:3b         | 0.45            |
+| 2    | Node 1 (port 11434) | General (locked) | mistral-nemo:latest | 0.45            |
+| 5    | Node 1 (port 11434) | General (rotate) | none                | N/A             |
+| 3    | Node 0 (port 11435) | Coding (locked)  | qwen2.5-coder:7b    | 0.371           |
+| 4    | Node 0 (port 11435) | Coding (locked)  | qwen2.5-coder:7b    | 0.371           |
+| 6    | Node 0 (port 11435) | Coding (rotate)  | none                | N/A             |
+
+
+## Detailed Metrics
+
+### llama3.2:3b
+
+- **Category**: general
+- **Coding Quality**: 0.917
+- **General Quality**: 1.0
+- **Avg Tokens/sec**: 0.1
+- **Latency (ms)**: 9999
+- **Coding Composite**: 0.413
+- **General Composite**: 0.45
+
+### qwen2.5-coder:7b
+
+- **Category**: coding
+- **Coding Quality**: 0.823
+- **General Quality**: 0.85
+- **Avg Tokens/sec**: 0.1
+- **Latency (ms)**: 9999
+- **Coding Composite**: 0.371
+- **General Composite**: 0.383
+
+### mistral-nemo:latest
+
+- **Category**: general
+- **Coding Quality**: 0.85
+- **General Quality**: 1.0
+- **Avg Tokens/sec**: 0.1
+- **Latency (ms)**: 9999
+- **Coding Composite**: 0.383
+- **General Composite**: 0.45
+
+## Scoring Formula
+
+- Composite = quality * 0.45 + token_speed_normalized * 0.30 + latency_score * 0.25
+- Speed normalized against 40 tok/sec ceiling (hardware-observed max)
+- Coding quality (per-prompt):
+code_gen: has_def×0.20 + has_return×0.20 + has_docstring×0.15 + has_type_hint×0.15 + has_code_block×0.10 + has_assert×0.08 + has_test_def×0.07 + has_import×0.05
+debug:    has_def×0.30 + has_return×0.30 + has_code_block×0.25 + has_assert×0.15
+refactor: has_def×0.25 + has_return×0.25 + has_code_block×0.20 + has_type_hint×0.15 + has_import×0.15
+- Category: override dict → quality delta (coding_avg - general_avg >= 0.1) → name pattern (coder/codestral/codellama/starcoder) → general
+