# Role: benchmark ## Purpose Benchmark all installed Ollama models to determine optimal slot assignments. Runs coding and general-purpose test suites, scores each model, and writes results to the benchmark report and `model_selection.json`. ## Test Details | Test | Category | Prompt | Scoring Method | |-------------|----------|-------------------------------------------------------------------------------------------------|---------------------------| | `code_gen` | coding | "Write a Python function that implements binary search on a sorted list. Include type hints and docstring." | `def` + `return` present, code structure | | `debug` | coding | "Find and fix the bug: `def factorial(n): return n * factorial(n)`. Explain the issue." | Base case identified, explanation quality | | `refactor` | coding | "Refactor to list comprehension: `result = []; for i in range(10): if i % 2 == 0: result.append(i*i)`" | List comprehension present, conciseness | | `explain` | general | "Explain recursion to a beginner programmer. Use a simple analogy." | Clarity, analogy present, length adequate | | `creative` | general | "Write a short poem about artificial intelligence." | Line count, poetic structure | | `reasoning` | general | "A farmer has 17 sheep. All but 9 die. How many are left? Explain step by step." | Correct answer (9), reasoning steps | | `latency` | latency | "Hi" | Time to first token (TTFT) | ## Quality Heuristics Each test type uses specific checks to score quality (0.0 to 1.0): - **code_gen** -- regex checks for `def `, `return`, type hint patterns (`: `), docstring (`"""`); score based on how many are present - **debug** -- checks for mention of base case, `if n <= 1` or similar fix, explanation length - **refactor** -- checks for `[` list comprehension syntax, absence of `for`/`append` loop pattern, output length relative to input - **explain** -- checks response length (>100 chars), presence of analogy keywords ("like", "imagine", "similar"), paragraph count - **creative** -- checks line count (>=4), presence of line breaks, absence of purely prose output - **reasoning** -- checks for "9" in response, presence of step indicators ("step", "first", "because", numbered lists) ## Scoring Formula ``` composite = (quality * 0.45) + (tokens_per_sec_normalized * 0.30) + (latency_score * 0.25) ``` ### Example Calculation For a model with quality=0.8, tokens/sec=38.2 (fastest=55.8), TTFT=420ms (slowest=510ms): ``` tokens_per_sec_normalized = 38.2 / 55.8 = 0.685 latency_score = 1.0 - (420 / 510) = 0.176 composite = (0.8 * 0.45) + (0.685 * 0.30) + (0.176 * 0.25) = 0.360 + 0.206 + 0.044 = 0.610 ``` ## Configuration All parameters are configurable via `group_vars/all.yml`: | Key | Default | Description | |--------------------------------|---------|------------------------------------------------| | `benchmark_min_tokens_per_sec` | 10 | Minimum tokens/sec to pass a model | | `benchmark_max_ttft_ms` | 5000 | Maximum acceptable time to first token (ms) | | `benchmark_quality_weight` | 0.45 | Weight of quality score in composite | | `benchmark_speed_weight` | 0.30 | Weight of normalized tokens/sec in composite | | `benchmark_latency_weight` | 0.25 | Weight of latency score in composite | | `benchmark_coding_threshold` | 0.15 | Min coding-general delta for coding classification | ## Candidate Models | Model | Size | Expected Speed | Reasoning | |-------------------------|-------|----------------|----------------------------------------| | `qwen2.5-coder:14b` | 14B | ~35-40 tok/s | Strong coding performance at moderate size | | `deepseek-coder-v2:16b`| 16B | ~30-38 tok/s | Competitive coding with broad language support | | `llama3.1:8b` | 8B | ~50-55 tok/s | Fast general-purpose model | | `mistral:7b` | 7B | ~50-58 tok/s | Fast general-purpose, good reasoning | ## Output Files ### Benchmark Report Written to `benchmarks/benchmark_.md`: ``` | Model | Coding Composite | General Composite | Classification | Tokens/sec | TTFT (ms) | |------------------------|------------------|-------------------|----------------|------------|-----------| | qwen2.5-coder:14b | 0.82 | 0.65 | coding | 38.2 | 420 | | ... | ... | ... | ... | ... | ... | ``` ### Model Selection Written to `model_selection.json`: ```json { "timestamp": "2025-01-15T10:30:00Z", "slot1_coding": "qwen2.5-coder:14b", "slot2_general": "llama3.1:8b", "slot3_backup": "deepseek-coder-v2:16b", "slot4_experimental": null, "results": { "qwen2.5-coder:14b": { "coding_composite": 0.82, "general_composite": 0.65, "classification": "coding", "tokens_per_sec": 38.2, "ttft_ms": 420 } } } ``` ## Tags ```bash ansible-playbook playbooks/site.yml --tags benchmark ```