# Benchmarks

## Overview

Dynamic benchmark system for all installed Ollama models. Runs a suite of coding and
general-purpose tests against every model on the Ollama server, scores each model on a
composite metric, and assigns models to the 6-slot dual-socket system based on results.

Modelfile aliases (`coder-128k`, `coder-32k`, `coder-rotate`, `llama-family`,
`gemma-family`) are automatically excluded from benchmarking — they share weights with
real models and their large context window parameters would stall every run with
285-second KV-cache allocations.

## How to Run

**Benchmark all installed models:**

```bash
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml
```

**Benchmark specific models only:**

```bash
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml \
  -e "benchmark_models=qwen2.5-coder:14b,deepseek-coder-v2:16b"
```

**Benchmark and immediately push 6-slot warm-up selections:**

```bash
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml && \
ansible-playbook playbooks/04_models.yml -K -e @local.yml
```

## Three-Pass Execution

Models are split into three size tiers before benchmarking. Each tier gets its own
per-request timeout to avoid small models waiting behind 70 B giants:

| Tier   | RAM threshold | Timeout | Description                       |
|--------|---------------|---------|-----------------------------------|
| Small  | < 10 GB       | 300 s   | 7 B and under — fast path         |
| Medium | 10–15 GB      | 900 s   | 16 B lite / 12 B — standard wait  |
| Large  | > 15 GB       | 1200 s  | 34 B+ — 20-minute ceiling         |

**Size source vs runtime RAM:** `ollama list` reports on-disk (compressed) sizes, which
are smaller than actual runtime RAM usage (model weights + KV cache + overhead). A
`benchmark_size_overhead_factor` (default `1.2`) is applied when computing tier
boundaries: the disk-size cutoffs are divided by the factor before comparison. For
example, with default settings a 9 GB on-disk model is treated as ~10.8 GB at runtime
and falls in the medium tier rather than small.

**Override tier boundaries:**

```bash
# Adjust where small/medium boundary sits
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml \
  -e "benchmark_small_max_gb=8 benchmark_medium_max_gb=20"

# Tune the overhead factor if your models load larger/smaller than expected
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml \
  -e "benchmark_size_overhead_factor=1.25"

# Override timeouts only
ansible-playbook playbooks/03_benchmark.yml -K -e @local.yml \
  -e "benchmark_medium_timeout=600 benchmark_large_timeout=1800"
```

## Test Suites

### Coding Tests

| Test       | Prompt                                                                     | What Is Measured                                   |
|------------|----------------------------------------------------------------------------|----------------------------------------------------|
| `code_gen` | Write a Python merge sort with type hints, docstring, and 3 unit tests     | `def`, `return`, `"""`, `->`, `assert`, `def test_`, `import` |
| `debug`    | Find and fix 3 bugs in a given Python function                             | `def`, `return`, code block, `assert`              |
| `refactor` | Refactor a loop for readability and performance                            | `def`, `return`, code block, type hint, `import`   |

### General Tests

| Test        | Prompt                                                     | What Is Measured                                     |
|-------------|------------------------------------------------------------|------------------------------------------------------|
| `explain`   | Explain how Python's GIL works and when it matters         | Response length, paragraph structure, list formatting |
| `creative`  | Suggest 5 fun family activities for a rainy weekend        | Response length, paragraph structure, list formatting |
| `reasoning` | Apple arithmetic word problem                              | Response length, paragraph structure, list formatting |

### Latency Test

| Test      | Prompt | What Is Measured                                   |
|-----------|--------|----------------------------------------------------|
| `latency` | "Hi"   | Total response time (eval + prompt eval), used as TTFT proxy |

## Scoring

### Composite Score Formula

For each category (coding, general), a composite score is calculated:

```
composite = (quality * 0.45) + (tokens_per_sec / ceiling, capped 1.0) * 0.30
          + (1 - ttft_ms / 5000, floored 0) * 0.25
```

Where:
- `quality` — 0.0–1.0 from heuristic checks per test type (see CLAUDE.md for weights)
- `tokens_per_sec` — averaged across all test responses; normalized against `benchmark_toks_norm_ceiling` (default 40)
- `ttft_ms` — latency test response time in milliseconds

### Classification Rule

A model is classified as **coding** if:

```
coding_composite - general_composite >= benchmark_coding_threshold   # default 0.10
```

Name-pattern heuristics (`coder`, `codestral`, `codellama`, `starcoder`) apply as a
tiebreaker. Category can also be forced with `model_category_overrides` in `group_vars/all.yml`.

## Thresholds and Configuration

All thresholds are configurable in `inventory/group_vars/all.yml`:

| Key                               | Default | Description                                            |
|-----------------------------------|---------|--------------------------------------------------------|
| `benchmark_thresholds.min_tokens_per_sec`  | 5.0  | Minimum tok/sec to be slot-eligible          |
| `benchmark_thresholds.min_quality_score`   | 0.6  | Minimum quality score to be slot-eligible    |
| `benchmark_thresholds.min_composite_score` | 0.55 | Minimum composite to avoid threshold warning |
| `benchmark_toks_norm_ceiling`     | 40      | tok/sec ceiling for normalization (dual-socket target) |
| `benchmark_coding_threshold`      | 0.10    | coding-general composite delta for classification      |
| `benchmark_small_max_gb`          | 10      | Runtime RAM upper bound for small pass (GB)            |
| `benchmark_medium_max_gb`         | 15      | Runtime RAM upper bound for medium pass (GB)           |
| `benchmark_size_overhead_factor`  | 1.2     | Multiplier applied to `ollama list` disk sizes to estimate runtime RAM |
| `benchmark_small_timeout`         | 300     | Per-request timeout for small models (seconds)         |
| `benchmark_medium_timeout`        | 900     | Per-request timeout for medium models (seconds)        |
| `benchmark_large_timeout`         | 1200    | Per-request timeout for large models (seconds)         |
| `benchmark_skip_aliases`          | see below| Modelfile aliases excluded from benchmark loop        |

Default `benchmark_skip_aliases`:
```yaml
- coder-128k
- coder-32k
- coder-rotate
- llama-family
- gemma-family
```

## Output Format

### Benchmark Report

Each run produces `benchmarks/results/benchmark_<timestamp>.md`. The slot table now
covers all 6 slots across both NUMA instances:

```
| Slot | Socket              | Role            | Model                     | Composite |
|------|---------------------|-----------------|---------------------------|-----------|
| 1    | Node 1 (port 11434) | General (locked)| llama3.1:8b               | 0.74      |
| 2    | Node 1 (port 11434) | General (locked)| mistral:latest            | 0.71      |
| 5    | Node 1 (port 11434) | General (rotate)| llama3.2:3b               | 0.63      |
| 3    | Node 0 (port 11435) | Coding (locked) | deepseek-coder-v2:16b     | 0.82      |
| 4    | Node 0 (port 11435) | Coding (locked) | qwen2.5-coder:7b          | 0.78      |
| 6    | Node 0 (port 11435) | Coding (rotate) | codegemma:7b              | 0.69      |
```

### model_selection.json

Results are written to `benchmarks/results/model_selection.json`:

```json
{
  "slot1_general": "llama3.1:8b",
  "slot2_general": "mistral:latest",
  "slot5_general_rotate": "llama3.2:3b",
  "slot3_coding": "deepseek-coder-v2:16b",
  "slot4_coding": "qwen2.5-coder:7b",
  "slot6_coding_rotate": "codegemma:7b",
  "general_ranking": [...],
  "coding_ranking": [...],
  "all_metrics": { ... }
}
```

This file is read by `04_models.yml` to decide what to pull and warm up. It is committed
to the repo so slot selections survive a clean checkout.