benchmark_20260307T170059.md 1.6 KB

Benchmark Results - 20260307T170059

Model Selection

Slot Role Model Composite Score
1 General (Primary) llama3.2:3b 0.967
2 General (Secondary) llama3.2:3b 0.967
3 Coding (Primary) deepseek-coder-v2 0.738
4 Coding (Secondary) qwen2.5-coder:7b 0.63

Detailed Metrics

deepseek-coder-v2

  • Category: coding
  • Coding Quality: 0.667
  • General Quality: 0.918
  • Avg Tokens/sec: 20.2
  • Latency (ms): 1744.5
  • Coding Composite: 0.738
  • General Composite: 0.852

qwen2.5-coder:7b

  • Category: coding
  • Coding Quality: 0.64
  • General Quality: 0.922
  • Avg Tokens/sec: 11.2
  • Latency (ms): 1211.5
  • Coding Composite: 0.63
  • General Composite: 0.757

llama3.2:3b

  • Category: general
  • Coding Quality: 0.607
  • General Quality: 0.991
  • Avg Tokens/sec: 22.5
  • Latency (ms): 576.1
  • Coding Composite: 0.794
  • General Composite: 0.967

Scoring Formula

  • Composite = quality * 0.45 + token_speed_normalized * 0.30 + latency_score * 0.25
  • Speed normalized against 22 tok/sec ceiling (hardware-observed max)
  • Coding quality: has_def×0.20 + has_return×0.20 + has_docstring×0.15 + has_type_hint×0.15 + has_code_block×0.10 + has_assert×0.08 + has_test_def×0.07 + has_import×0.05
  • Category: override dict → quality delta (coding_avg - general_avg >= 0.1) → name pattern (coder/codestral/codellama/starcoder) → general