Role: benchmark

Purpose

Benchmark all installed Ollama models to determine optimal slot assignments. Runs coding and general-purpose test suites, scores each model, and writes results to the benchmark report and model_selection.json.

Test Details

Test	Category	Prompt	Scoring Method
`code_gen`	coding	"Write a Python function that implements binary search on a sorted list. Include type hints and docstring."	`def` + `return` present, code structure
`debug`	coding	"Find and fix the bug: `def factorial(n): return n * factorial(n)`. Explain the issue."	Base case identified, explanation quality
`refactor`	coding	"Refactor to list comprehension: `result = []; for i in range(10): if i % 2 == 0: result.append(i*i)`"	List comprehension present, conciseness
`explain`	general	"Explain recursion to a beginner programmer. Use a simple analogy."	Clarity, analogy present, length adequate
`creative`	general	"Write a short poem about artificial intelligence."	Line count, poetic structure
`reasoning`	general	"A farmer has 17 sheep. All but 9 die. How many are left? Explain step by step."	Correct answer (9), reasoning steps
`latency`	latency	"Hi"	Time to first token (TTFT)

Quality Heuristics

Each test type uses specific checks to score quality (0.0 to 1.0):

code_gen -- regex checks for def, return, type hint patterns (:), docstring ("""); score based on how many are present
debug -- checks for mention of base case, if n <= 1 or similar fix, explanation length
refactor -- checks for [ list comprehension syntax, absence of for/append loop pattern, output length relative to input
explain -- checks response length (>100 chars), presence of analogy keywords ("like", "imagine", "similar"), paragraph count
creative -- checks line count (>=4), presence of line breaks, absence of purely prose output
reasoning -- checks for "9" in response, presence of step indicators ("step", "first", "because", numbered lists)

Scoring Formula

composite = (quality * 0.45) + (tokens_per_sec_normalized * 0.30) + (latency_score * 0.25)

Example Calculation

For a model with quality=0.8, tokens/sec=38.2 (fastest=55.8), TTFT=420ms (slowest=510ms):

tokens_per_sec_normalized = 38.2 / 55.8 = 0.685
latency_score = 1.0 - (420 / 510) = 0.176

composite = (0.8 * 0.45) + (0.685 * 0.30) + (0.176 * 0.25)
          = 0.360 + 0.206 + 0.044
          = 0.610

Configuration

All parameters are configurable via group_vars/all.yml:

Key	Default	Description
`benchmark_min_tokens_per_sec`	10	Minimum tokens/sec to pass a model
`benchmark_max_ttft_ms`	5000	Maximum acceptable time to first token (ms)
`benchmark_quality_weight`	0.45	Weight of quality score in composite
`benchmark_speed_weight`	0.30	Weight of normalized tokens/sec in composite
`benchmark_latency_weight`	0.25	Weight of latency score in composite
`benchmark_coding_threshold`	0.15	Min coding-general delta for coding classification

Candidate Models

Model	Size	Expected Speed	Reasoning
`qwen2.5-coder:14b`	14B	~35-40 tok/s	Strong coding performance at moderate size
`deepseek-coder-v2:16b`	16B	~30-38 tok/s	Competitive coding with broad language support
`llama3.1:8b`	8B	~50-55 tok/s	Fast general-purpose model
`mistral:7b`	7B	~50-58 tok/s	Fast general-purpose, good reasoning

Output Files

Benchmark Report

Written to benchmarks/benchmark_<timestamp>.md:

| Model                  | Coding Composite | General Composite | Classification | Tokens/sec | TTFT (ms) |
|------------------------|------------------|-------------------|----------------|------------|-----------|
| qwen2.5-coder:14b      | 0.82             | 0.65              | coding         | 38.2       | 420       |
| ...                    | ...              | ...               | ...            | ...        | ...       |

Model Selection

Written to model_selection.json:

{
  "timestamp": "2025-01-15T10:30:00Z",
  "slot1_coding": "qwen2.5-coder:14b",
  "slot2_general": "llama3.1:8b",
  "slot3_backup": "deepseek-coder-v2:16b",
  "slot4_experimental": null,
  "results": {
    "qwen2.5-coder:14b": {
      "coding_composite": 0.82,
      "general_composite": 0.65,
      "classification": "coding",
      "tokens_per_sec": 38.2,
      "ttft_ms": 420
    }
  }
}

README.md 5.4 KB

Permalink Verlauf Originalformat

Role: benchmark

Purpose

Test Details

Quality Heuristics

Scoring Formula

Example Calculation

Configuration

Candidate Models

Output Files

Benchmark Report

Model Selection

Tags

README.md 5.4 KB Permalink Verlauf Originalformat

Role: benchmark

Purpose

Test Details

Quality Heuristics

Scoring Formula

Example Calculation

Configuration

Candidate Models

Output Files

Benchmark Report

Model Selection

Tags

README.md 5.4 KB

Permalink Verlauf Originalformat