A modular, SQLite-backed benchmark for evaluating local LLMs running on Ollama. Designed for operational reliability in agentic and automated pipelines — not general intelligence. It rewards format obedience, structured output correctness, tool call precision, and hallucination resistance. It intentionally penalises verbosity and creative deviation.

Philosophy

Most public benchmarks measure what a model knows. This one measures whether it can be trusted in production:

Does it follow exact format instructions?
Does it call tools correctly without adding noise?
Does it refuse to fabricate facts?
Is it consistent across multiple runs?

A model scoring 9, 9, 2, 8, 1 is worse for agents than one scoring 7, 7, 7, 7, 7.

Test Suite

16 tests across 6 categories, weighted by production relevance.

Agent / Tool Reliability — 25%

Test	What it measures
`tool_calling`	Returns a single valid function call with no extra text
`multi_step_agent`	Chains 3 tool calls in sequence and produces a final answer

Coding / Infrastructure — 25%

Test	What it measures
`coding`	Produces a working LIS function with correct time complexity
`yaml_generation`	Returns valid parseable Kubernetes Deployment YAML
`artifact_mermaid`	Returns a valid Mermaid flowchart with all 8 pipeline stages
`json_schema`	Returns a valid JSON Schema with required fields and constraints

RAG / Context Fidelity — 20%

Test	What it measures
`rag`	Summarises a provided document accurately without invention
`context_begin`	Retrieves a fact from the beginning of a document
`context_middle`	Retrieves a fact from the middle of a document
`context_end`	Retrieves a fact from the end of a document

Structured Outputs — 15%

Test	What it measures
`structured`	Returns nested JSON with typed fields (recommendations array)
`compression`	Compresses content into exactly 10 bullet points preserving all industries

Hallucination Resistance — 10%

Test	What it measures
`hallucination`	Refuses to describe a non-existent book — rewards uncertainty, penalises invention

Pure Reasoning — 5%

Test	What it measures
`reasoning`	Solves a multi-step percentage problem correctly
`math`	Solves a rate problem requiring correct reasoning about independence
`agent`	Plans a search strategy meeting 5 explicit requirements

Scoring Architecture

Raw output
  ↓
normalize_text()        strip ANSI, thinking tokens, Ollama stats
  ↓
Layer 1: Deterministic Validator
  0 or 10 → skip judge (definitive)
  1–9     → blend with judge (80% validator / 20% judge)
  ↓
Layer 2: Semantic Judge (only when needed)
  qwen2.5:14b with strict rubric — never benchmarked
  ↓
Layer 3: Embedding Similarity (RAG test only)
  nomic-embed-text via Ollama
  ↓
format_score (separate)
  ANSI codes, word limit, markdown obedience
  ↓
combined = semantic × 0.8 + format × 0.2
weighted_avg = Σ(semantic × test_weight)

What the Numbers Mean

Metric	Description
`w`	Weighted semantic average — primary score
`σ`	Standard deviation across tests — lower is more reliable
`fail%`	Percentage of tests scoring ≤ 2/10 — hard failures
`tok/s`	Generation speed on this hardware
`🌡`	Average GPU temperature during benchmark

Compliance rates track pass rate (score ≥ 8) for:

JSON — nested structured output
YAML — Kubernetes manifest generation
Tool — function call format
Hallucination — refusal of fabricated content

Requirements

pip install pyyaml rapidfuzz requests
# Ollama running with: judge model, embed model, and models under test

Usage

# Run all baseline models
python3 main.py

# Single model (auto-detects thinking mode)
python3 main.py --model granite4.1:8b

# Variance analysis — 3 runs per model
python3 main.py --mode baseline --runs 3

# Auto-discover and test all models in ollama list
python3 main.py --test-all

# Reports
python3 main.py --report               # latest run per model
python3 main.py --report --report-best # best run per model

# Fast run (no thermal cooldown)
python3 main.py --no-cooldown

Configuration

Edit config.py:

MODELS_BASELINE_DIRECT   = ["granite4.1:8b", "qwen2.5-coder:14b"]
MODELS_BASELINE_THINKING = ["nemotron-3-nano:4b", "gemma4:e4b"]
JUDGE_MODEL              = "qwen2.5:14b"   # dedicated — never benchmarked
EMBED_MODEL              = "nomic-embed-text"

File Structure

benchmark_v4/
  config.py      models, weights, settings
  prompts.py     all prompts, ground truths, judge rubrics
  validators.py  Layer 1: deterministic scoring
  judge.py       Layer 2: LLM judge + embedding similarity
  scoring.py     combines all layers into final scores
  runner.py      executes models, orchestrates benchmark
  storage.py     SQLite read/write (benchmark_v4.db)
  reporting.py   terminal output
  main.py        CLI entry point

Results Database

All results stored in benchmark_v4.db (SQLite, never deleted).

-- Latest ranking
SELECT model, weighted_avg, stdev_all, failure_rate_pct
FROM runs
WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model)
ORDER BY weighted_avg DESC;

-- Compliance rates
SELECT model, compliance_json, compliance_yaml,
       compliance_tool, compliance_hall
FROM runs
WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model);

-- Detailed test scores
SELECT test, semantic_score, format_score, notes
FROM details
WHERE model = 'granite4.1:8b'
AND run_id = (SELECT MAX(id) FROM runs WHERE model = 'granite4.1:8b');

Validated Stack (RTX 5060 Ti 16GB)

Model	Role	w	σ	fail%
granite4.1:8b	Reliable default	6.85	0.81	0%
qwen2.5-coder:14b	Coding / infra	6.69	1.15	0%
nemotron-3-nano:4b	Fast chat	6.37	2.87	6%
gemma4:e4b	RAG / research	6.06	2.56	6%

10 models tested. 6 rejected. Rankings stable across rebuilds.

Output Example by categories

===================================================================
   CATEGORY BREAKDOWN (latest run per model)
====================================================================

  Model                                     agent   code    rag  struct  hall  reason
  ----------------------------------------------------------------
  ★ gemma4:e4b                                   8.5    10.0     9.0     10.0   10.0      7.0
  ★ granite4.1:8b                               10.0    10.0     9.0      7.5   10.0     7.67
    phi4:latest                                 10.0    10.0     9.0      6.5   10.0      7.0
  ★ nemotron-3-nano:4b                           7.0     7.5     9.5     10.0   10.0     8.33
    lfm2:latest                                 10.0     7.5     9.0     10.0    4.0      8.0
  ★ qwen2.5-coder:14b                           10.0    10.0     8.5      9.5    0.0     6.67
    mistral-nemo:12b                             5.0    10.0     8.5      9.5    6.0      5.0

README.md Unescape Escape

LLM Benchmark V4