LLM Benchmark V4
A modular, SQLite-backed benchmark for evaluating local LLMs running on Ollama. Designed for operational reliability in agentic and automated pipelines — not general intelligence. It rewards format obedience, structured output correctness, tool call precision, and hallucination resistance. It intentionally penalises verbosity and creative deviation.
Philosophy
Most public benchmarks measure what a model knows. This one measures whether it can be trusted in production:
- Does it follow exact format instructions?
- Does it call tools correctly without adding noise?
- Does it refuse to fabricate facts?
- Is it consistent across multiple runs?
A model scoring 9, 9, 2, 8, 1 is worse for agents than one scoring 7, 7, 7, 7, 7.
Test Suite
16 tests across 6 categories, weighted by production relevance.
Agent / Tool Reliability — 25%
| Test | What it measures |
|---|---|
tool_calling |
Returns a single valid function call with no extra text |
multi_step_agent |
Chains 3 tool calls in sequence and produces a final answer |
Coding / Infrastructure — 25%
| Test | What it measures |
|---|---|
coding |
Produces a working LIS function with correct time complexity |
yaml_generation |
Returns valid parseable Kubernetes Deployment YAML |
artifact_mermaid |
Returns a valid Mermaid flowchart with all 8 pipeline stages |
json_schema |
Returns a valid JSON Schema with required fields and constraints |
RAG / Context Fidelity — 20%
| Test | What it measures |
|---|---|
rag |
Summarises a provided document accurately without invention |
context_begin |
Retrieves a fact from the beginning of a document |
context_middle |
Retrieves a fact from the middle of a document |
context_end |
Retrieves a fact from the end of a document |
Structured Outputs — 15%
| Test | What it measures |
|---|---|
structured |
Returns nested JSON with typed fields (recommendations array) |
compression |
Compresses content into exactly 10 bullet points preserving all industries |
Hallucination Resistance — 10%
| Test | What it measures |
|---|---|
hallucination |
Refuses to describe a non-existent book — rewards uncertainty, penalises invention |
Pure Reasoning — 5%
| Test | What it measures |
|---|---|
reasoning |
Solves a multi-step percentage problem correctly |
math |
Solves a rate problem requiring correct reasoning about independence |
agent |
Plans a search strategy meeting 5 explicit requirements |
Scoring Architecture
Raw output
↓
normalize_text() strip ANSI, thinking tokens, Ollama stats
↓
Layer 1: Deterministic Validator
0 or 10 → skip judge (definitive)
1–9 → blend with judge (80% validator / 20% judge)
↓
Layer 2: Semantic Judge (only when needed)
qwen2.5:14b with strict rubric — never benchmarked
↓
Layer 3: Embedding Similarity (RAG test only)
nomic-embed-text via Ollama
↓
format_score (separate)
ANSI codes, word limit, markdown obedience
↓
combined = semantic × 0.8 + format × 0.2
weighted_avg = Σ(semantic × test_weight)
What the Numbers Mean
| Metric | Description |
|---|---|
w |
Weighted semantic average — primary score |
σ |
Standard deviation across tests — lower is more reliable |
fail% |
Percentage of tests scoring ≤ 2/10 — hard failures |
tok/s |
Generation speed on this hardware |
🌡 |
Average GPU temperature during benchmark |
Compliance rates track pass rate (score ≥ 8) for:
- JSON — nested structured output
- YAML — Kubernetes manifest generation
- Tool — function call format
- Hallucination — refusal of fabricated content
Requirements
pip install pyyaml rapidfuzz requests
# Ollama running with: judge model, embed model, and models under test
Usage
# Run all baseline models
python3 main.py
# Single model (auto-detects thinking mode)
python3 main.py --model granite4.1:8b
# Variance analysis — 3 runs per model
python3 main.py --mode baseline --runs 3
# Auto-discover and test all models in ollama list
python3 main.py --test-all
# Reports
python3 main.py --report # latest run per model
python3 main.py --report --report-best # best run per model
# Fast run (no thermal cooldown)
python3 main.py --no-cooldown
Configuration
Edit config.py:
MODELS_BASELINE_DIRECT = ["granite4.1:8b", "qwen2.5-coder:14b"]
MODELS_BASELINE_THINKING = ["nemotron-3-nano:4b", "gemma4:e4b"]
JUDGE_MODEL = "qwen2.5:14b" # dedicated — never benchmarked
EMBED_MODEL = "nomic-embed-text"
File Structure
benchmark_v4/
config.py models, weights, settings
prompts.py all prompts, ground truths, judge rubrics
validators.py Layer 1: deterministic scoring
judge.py Layer 2: LLM judge + embedding similarity
scoring.py combines all layers into final scores
runner.py executes models, orchestrates benchmark
storage.py SQLite read/write (benchmark_v4.db)
reporting.py terminal output
main.py CLI entry point
Results Database
All results stored in benchmark_v4.db (SQLite, never deleted).
-- Latest ranking
SELECT model, weighted_avg, stdev_all, failure_rate_pct
FROM runs
WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model)
ORDER BY weighted_avg DESC;
-- Compliance rates
SELECT model, compliance_json, compliance_yaml,
compliance_tool, compliance_hall
FROM runs
WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model);
-- Detailed test scores
SELECT test, semantic_score, format_score, notes
FROM details
WHERE model = 'granite4.1:8b'
AND run_id = (SELECT MAX(id) FROM runs WHERE model = 'granite4.1:8b');
Validated Stack (RTX 5060 Ti 16GB)
| Model | Role | w | σ | fail% |
|---|---|---|---|---|
| granite4.1:8b | Reliable default | 6.85 | 0.81 | 0% |
| qwen2.5-coder:14b | Coding / infra | 6.69 | 1.15 | 0% |
| nemotron-3-nano:4b | Fast chat | 6.37 | 2.87 | 6% |
| gemma4:e4b | RAG / research | 6.06 | 2.56 | 6% |
10 models tested. 6 rejected. Rankings stable across rebuilds.
Output Example by categories
===================================================================
CATEGORY BREAKDOWN (latest run per model)
====================================================================
Model agent code rag struct hall reason
----------------------------------------------------------------
★ gemma4:e4b 8.5 10.0 9.0 10.0 10.0 7.0
★ granite4.1:8b 10.0 10.0 9.0 7.5 10.0 7.67
phi4:latest 10.0 10.0 9.0 6.5 10.0 7.0
★ nemotron-3-nano:4b 7.0 7.5 9.5 10.0 10.0 8.33
lfm2:latest 10.0 7.5 9.0 10.0 4.0 8.0
★ qwen2.5-coder:14b 10.0 10.0 8.5 9.5 0.0 6.67
mistral-nemo:12b 5.0 10.0 8.5 9.5 6.0 5.0