benchmark_20260307T184212.md 2.9 KB

Benchmark Results - 20260307T184212

Model Selection

Slot Role Model Composite Score
1 General (Primary) llama3.2:3b 0.001
2 General (Secondary) gemma-family:latest 0.0
3 Coding (Primary) coder-128k:latest 0.001
4 Coding (Secondary) coder-32k:latest 0.001

Detailed Metrics

gemma-family:latest

  • Category: general
  • Coding Quality: 0
  • General Quality: 0
  • Avg Tokens/sec: 0.0
  • Latency (ms): 9999
  • Coding Composite: 0.0
  • General Composite: 0.0

    llama-family:latest

  • Category: general

  • Coding Quality: 0

  • General Quality: 0

  • Avg Tokens/sec: 0.0

  • Latency (ms): 9999

  • Coding Composite: 0.0

  • General Composite: 0.0

    coder-128k:latest

  • Category: coding

  • Coding Quality: 0

  • General Quality: 0

  • Avg Tokens/sec: 0.0

  • Latency (ms): 285394.5

  • Coding Composite: 0.001

  • General Composite: 0.001

    coder-32k:latest

  • Category: coding

  • Coding Quality: 0

  • General Quality: 0

  • Avg Tokens/sec: 0.1

  • Latency (ms): 142328.6

  • Coding Composite: 0.001

  • General Composite: 0.001

    llama3.1:8b

  • Category: general

  • Coding Quality: 0

  • General Quality: 0

  • Avg Tokens/sec: 0.0

  • Latency (ms): 9999

  • Coding Composite: 0.0

  • General Composite: 0.0

    deepseek-coder-v2:latest

  • Category: coding

  • Coding Quality: 0

  • General Quality: 0

  • Avg Tokens/sec: 0.0

  • Latency (ms): 9999

  • Coding Composite: 0.0

  • General Composite: 0.0

    qwen2.5-coder:7b

  • Category: coding

  • Coding Quality: 0

  • General Quality: 0

  • Avg Tokens/sec: 0.1

  • Latency (ms): 143942.9

  • Coding Composite: 0.001

  • General Composite: 0.001

    gemma3:12b-it-q4_K_M

  • Category: general

  • Coding Quality: 0

  • General Quality: 0

  • Avg Tokens/sec: 0.0

  • Latency (ms): 9999

  • Coding Composite: 0.0

  • General Composite: 0.0

    llama3.2:3b

  • Category: general

  • Coding Quality: 0

  • General Quality: 0

  • Avg Tokens/sec: 0.1

  • Latency (ms): 139756.5

  • Coding Composite: 0.001

  • General Composite: 0.001

Scoring Formula

  • Composite = quality * 0.45 + token_speed_normalized * 0.30 + latency_score * 0.25
  • Speed normalized against 22 tok/sec ceiling (hardware-observed max)
  • Coding quality (per-prompt): code_gen: has_def×0.20 + has_return×0.20 + has_docstring×0.15 + has_type_hint×0.15 + has_code_block×0.10 + has_assert×0.08 + has_test_def×0.07 + has_import×0.05 debug: has_def×0.30 + has_return×0.30 + has_code_block×0.25 + has_assert×0.15 refactor: has_def×0.25 + has_return×0.25 + has_code_block×0.20 + has_type_hint×0.15 + has_import×0.15
  • Category: override dict → quality delta (coding_avg - general_avg >= 0.1) → name pattern (coder/codestral/codellama/starcoder) → general