# LLM Benchmark V4 A modular, SQLite-backed benchmark for evaluating local LLMs running on Ollama. Designed for **operational reliability in agentic and automated pipelines** — not general intelligence. It rewards format obedience, structured output correctness, tool call precision, and hallucination resistance. It intentionally penalises verbosity and creative deviation. --- ## Philosophy Most public benchmarks measure what a model knows. This one measures whether it can be trusted in production: - Does it follow exact format instructions? - Does it call tools correctly without adding noise? - Does it refuse to fabricate facts? - Is it consistent across multiple runs? A model scoring `9, 9, 2, 8, 1` is worse for agents than one scoring `7, 7, 7, 7, 7`. --- ## Test Suite 16 tests across 6 categories, weighted by production relevance. ### Agent / Tool Reliability — 25% | Test | What it measures | |---|---| | `tool_calling` | Returns a single valid function call with no extra text | | `multi_step_agent` | Chains 3 tool calls in sequence and produces a final answer | ### Coding / Infrastructure — 25% | Test | What it measures | |---|---| | `coding` | Produces a working LIS function with correct time complexity | | `yaml_generation` | Returns valid parseable Kubernetes Deployment YAML | | `artifact_mermaid` | Returns a valid Mermaid flowchart with all 8 pipeline stages | | `json_schema` | Returns a valid JSON Schema with required fields and constraints | ### RAG / Context Fidelity — 20% | Test | What it measures | |---|---| | `rag` | Summarises a provided document accurately without invention | | `context_begin` | Retrieves a fact from the beginning of a document | | `context_middle` | Retrieves a fact from the middle of a document | | `context_end` | Retrieves a fact from the end of a document | ### Structured Outputs — 15% | Test | What it measures | |---|---| | `structured` | Returns nested JSON with typed fields (recommendations array) | | `compression` | Compresses content into exactly 10 bullet points preserving all industries | ### Hallucination Resistance — 10% | Test | What it measures | |---|---| | `hallucination` | Refuses to describe a non-existent book — rewards uncertainty, penalises invention | ### Pure Reasoning — 5% | Test | What it measures | |---|---| | `reasoning` | Solves a multi-step percentage problem correctly | | `math` | Solves a rate problem requiring correct reasoning about independence | | `agent` | Plans a search strategy meeting 5 explicit requirements | --- ## Scoring Architecture ``` Raw output ↓ normalize_text() strip ANSI, thinking tokens, Ollama stats ↓ Layer 1: Deterministic Validator 0 or 10 → skip judge (definitive) 1–9 → blend with judge (80% validator / 20% judge) ↓ Layer 2: Semantic Judge (only when needed) qwen2.5:14b with strict rubric — never benchmarked ↓ Layer 3: Embedding Similarity (RAG test only) nomic-embed-text via Ollama ↓ format_score (separate) ANSI codes, word limit, markdown obedience ↓ combined = semantic × 0.8 + format × 0.2 weighted_avg = Σ(semantic × test_weight) ``` --- ## What the Numbers Mean | Metric | Description | |---|---| | `w` | Weighted semantic average — primary score | | `σ` | Standard deviation across tests — lower is more reliable | | `fail%` | Percentage of tests scoring ≤ 2/10 — hard failures | | `tok/s` | Generation speed on this hardware | | `🌡` | Average GPU temperature during benchmark | **Compliance rates** track pass rate (score ≥ 8) for: - JSON — nested structured output - YAML — Kubernetes manifest generation - Tool — function call format - Hallucination — refusal of fabricated content --- ## Requirements ```bash pip install pyyaml rapidfuzz requests # Ollama running with: judge model, embed model, and models under test ``` --- ## Usage ```bash # Run all baseline models python3 main.py # Single model (auto-detects thinking mode) python3 main.py --model granite4.1:8b # Variance analysis — 3 runs per model python3 main.py --mode baseline --runs 3 # Auto-discover and test all models in ollama list python3 main.py --test-all # Reports python3 main.py --report # latest run per model python3 main.py --report --report-best # best run per model # Fast run (no thermal cooldown) python3 main.py --no-cooldown ``` --- ## Configuration Edit `config.py`: ```python MODELS_BASELINE_DIRECT = ["granite4.1:8b", "qwen2.5-coder:14b"] MODELS_BASELINE_THINKING = ["nemotron-3-nano:4b", "gemma4:e4b"] JUDGE_MODEL = "qwen2.5:14b" # dedicated — never benchmarked EMBED_MODEL = "nomic-embed-text" ``` --- ## File Structure ``` benchmark_v4/ config.py models, weights, settings prompts.py all prompts, ground truths, judge rubrics validators.py Layer 1: deterministic scoring judge.py Layer 2: LLM judge + embedding similarity scoring.py combines all layers into final scores runner.py executes models, orchestrates benchmark storage.py SQLite read/write (benchmark_v4.db) reporting.py terminal output main.py CLI entry point ``` --- ## Results Database All results stored in `benchmark_v4.db` (SQLite, never deleted). ```sql -- Latest ranking SELECT model, weighted_avg, stdev_all, failure_rate_pct FROM runs WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model) ORDER BY weighted_avg DESC; -- Compliance rates SELECT model, compliance_json, compliance_yaml, compliance_tool, compliance_hall FROM runs WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model); -- Detailed test scores SELECT test, semantic_score, format_score, notes FROM details WHERE model = 'granite4.1:8b' AND run_id = (SELECT MAX(id) FROM runs WHERE model = 'granite4.1:8b'); ``` --- ## Validated Stack (RTX 5060 Ti 16GB) | Model | Role | w | σ | fail% | |---|---|---|---|---| | granite4.1:8b | Reliable default | 6.85 | 0.81 | 0% | | qwen2.5-coder:14b | Coding / infra | 6.69 | 1.15 | 0% | | nemotron-3-nano:4b | Fast chat | 6.37 | 2.87 | 6% | | gemma4:e4b | RAG / research | 6.06 | 2.56 | 6% | 10 models tested. 6 rejected. Rankings stable across rebuilds. ## Output Example by categories ``` =================================================================== CATEGORY BREAKDOWN (latest run per model) ==================================================================== Model agent code rag struct hall reason ---------------------------------------------------------------- ★ gemma4:e4b 8.5 10.0 9.0 10.0 10.0 7.0 ★ granite4.1:8b 10.0 10.0 9.0 7.5 10.0 7.67 phi4:latest 10.0 10.0 9.0 6.5 10.0 7.0 ★ nemotron-3-nano:4b 7.0 7.5 9.5 10.0 10.0 8.33 lfm2:latest 10.0 7.5 9.0 10.0 4.0 8.0 ★ qwen2.5-coder:14b 10.0 10.0 8.5 9.5 0.0 6.67 mistral-nemo:12b 5.0 10.0 8.5 9.5 6.0 5.0 ```