Dynamic benchmark system for all installed Ollama models. Runs a suite of coding and general-purpose tests against every model currently available on the Ollama server, scores each model on a composite metric, and assigns models to the 4-slot system based on results.
Benchmark all installed models:
ansible-playbook playbooks/05_benchmark.yml
Benchmark specific models only:
ansible-playbook playbooks/05_benchmark.yml -e '{"benchmark_specific_models":["qwen2.5-coder:14b","deepseek-coder-v2:16b"]}'
Benchmark with automatic model pulling if a better model is found:
ansible-playbook playbooks/05_benchmark.yml -e pull_if_better=true
| Test | Prompt | What Is Measured |
|---|---|---|
code_gen |
"Write a Python function that implements binary search on a sorted list. Include type hints and docstring." | Correctness (def + return present), code structure, tokens/sec |
debug |
"Find and fix the bug in this Python code: def factorial(n): return n * factorial(n). Explain the issue." |
Identifies base case bug, explanation quality, tokens/sec |
refactor |
"Refactor this code to use list comprehension: result = []; for i in range(10): if i % 2 == 0: result.append(i*i)" |
Produces list comprehension, conciseness, tokens/sec |
| Test | Prompt | What Is Measured |
|---|---|---|
explain |
"Explain the concept of recursion to a beginner programmer. Use a simple analogy." | Clarity, analogy presence, length adequacy, tokens/sec |
creative |
"Write a short poem about artificial intelligence." | Creativity (line count, poetic structure), tokens/sec |
reasoning |
"A farmer has 17 sheep. All but 9 die. How many are left? Explain your reasoning step by step." | Correct answer (9), step-by-step reasoning, tokens/sec |
| Test | Prompt | What Is Measured |
|---|---|---|
latency |
"Hi" | Time to first token (TTFT) |
/api/generate responseFor each category (coding, general), a composite score is calculated:
composite = (quality * 0.45) + (tokens_per_sec_normalized * 0.30) + (latency_score * 0.25)
Where:
quality is 0.0-1.0 based on heuristic checks for the test typetokens_per_sec_normalized is the model's tokens/sec divided by the fastest model's tokens/seclatency_score is 1.0 - (model_ttft / slowest_ttft)A model is classified as a coding model if:
coding_composite - general_composite >= 0.15
Otherwise it is classified as general.
All thresholds are configurable via group_vars/all.yml:
| Key | Default | Description |
|---|---|---|
benchmark_min_tokens_per_sec |
10 | Minimum tokens/sec to pass a model |
benchmark_max_ttft_ms |
5000 | Maximum time to first token in milliseconds |
benchmark_quality_weight |
0.45 | Weight of quality score in composite |
benchmark_speed_weight |
0.30 | Weight of tokens/sec in composite |
benchmark_latency_weight |
0.25 | Weight of latency score in composite |
benchmark_coding_threshold |
0.15 | Minimum coding-general delta for coding classification |
Each run produces benchmarks/benchmark_<timestamp>.md with a results table:
| Model | Coding Composite | General Composite | Classification | Tokens/sec | TTFT (ms) |
|------------------------|------------------|-------------------|----------------|------------|-----------|
| qwen2.5-coder:14b | 0.82 | 0.65 | coding | 38.2 | 420 |
| deepseek-coder-v2:16b | 0.78 | 0.63 | coding | 35.1 | 510 |
| llama3.1:8b | 0.61 | 0.74 | general | 52.3 | 280 |
| mistral:7b | 0.58 | 0.71 | general | 55.8 | 250 |
Results are also written to model_selection.json:
{
"timestamp": "2025-01-15T10:30:00Z",
"slot1_coding": "qwen2.5-coder:14b",
"slot2_general": "llama3.1:8b",
"slot3_backup": "deepseek-coder-v2:16b",
"slot4_experimental": null,
"results": { ... }
}
Slots are assigned from benchmark results as follows:
coding_composite scoregeneral_composite score-e slot4_model=<name>