commit 51e9389726f65f2f9873302b31dd047040f46e75 Author: rgcosta Date: Thu May 14 14:06:00 2026 +0000 Add README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..082b6ea --- /dev/null +++ b/README.md @@ -0,0 +1,237 @@ +# LLM Benchmark V4 + +A modular, SQLite-backed benchmark for evaluating local LLMs running on Ollama. +Designed for **operational reliability in agentic and automated pipelines** — not general intelligence. +It rewards format obedience, structured output correctness, tool call precision, and hallucination resistance. +It intentionally penalises verbosity and creative deviation. + +--- + +## Philosophy + +Most public benchmarks measure what a model knows. This one measures whether it can be trusted in production: + +- Does it follow exact format instructions? +- Does it call tools correctly without adding noise? +- Does it refuse to fabricate facts? +- Is it consistent across multiple runs? + +A model scoring `9, 9, 2, 8, 1` is worse for agents than one scoring `7, 7, 7, 7, 7`. + +--- + +## Test Suite + +16 tests across 6 categories, weighted by production relevance. + +### Agent / Tool Reliability — 25% + +| Test | What it measures | +|---|---| +| `tool_calling` | Returns a single valid function call with no extra text | +| `multi_step_agent` | Chains 3 tool calls in sequence and produces a final answer | + +### Coding / Infrastructure — 25% + +| Test | What it measures | +|---|---| +| `coding` | Produces a working LIS function with correct time complexity | +| `yaml_generation` | Returns valid parseable Kubernetes Deployment YAML | +| `artifact_mermaid` | Returns a valid Mermaid flowchart with all 8 pipeline stages | +| `json_schema` | Returns a valid JSON Schema with required fields and constraints | + +### RAG / Context Fidelity — 20% + +| Test | What it measures | +|---|---| +| `rag` | Summarises a provided document accurately without invention | +| `context_begin` | Retrieves a fact from the beginning of a document | +| `context_middle` | Retrieves a fact from the middle of a document | +| `context_end` | Retrieves a fact from the end of a document | + +### Structured Outputs — 15% + +| Test | What it measures | +|---|---| +| `structured` | Returns nested JSON with typed fields (recommendations array) | +| `compression` | Compresses content into exactly 10 bullet points preserving all industries | + +### Hallucination Resistance — 10% + +| Test | What it measures | +|---|---| +| `hallucination` | Refuses to describe a non-existent book — rewards uncertainty, penalises invention | + +### Pure Reasoning — 5% + +| Test | What it measures | +|---|---| +| `reasoning` | Solves a multi-step percentage problem correctly | +| `math` | Solves a rate problem requiring correct reasoning about independence | +| `agent` | Plans a search strategy meeting 5 explicit requirements | + +--- + +## Scoring Architecture + +``` +Raw output + ↓ +normalize_text() strip ANSI, thinking tokens, Ollama stats + ↓ +Layer 1: Deterministic Validator + 0 or 10 → skip judge (definitive) + 1–9 → blend with judge (80% validator / 20% judge) + ↓ +Layer 2: Semantic Judge (only when needed) + qwen2.5:14b with strict rubric — never benchmarked + ↓ +Layer 3: Embedding Similarity (RAG test only) + nomic-embed-text via Ollama + ↓ +format_score (separate) + ANSI codes, word limit, markdown obedience + ↓ +combined = semantic × 0.8 + format × 0.2 +weighted_avg = Σ(semantic × test_weight) +``` + +--- + +## What the Numbers Mean + +| Metric | Description | +|---|---| +| `w` | Weighted semantic average — primary score | +| `σ` | Standard deviation across tests — lower is more reliable | +| `fail%` | Percentage of tests scoring ≤ 2/10 — hard failures | +| `tok/s` | Generation speed on this hardware | +| `🌡` | Average GPU temperature during benchmark | + +**Compliance rates** track pass rate (score ≥ 8) for: +- JSON — nested structured output +- YAML — Kubernetes manifest generation +- Tool — function call format +- Hallucination — refusal of fabricated content + +--- + +## Requirements + +```bash +pip install pyyaml rapidfuzz requests +# Ollama running with: judge model, embed model, and models under test +``` + +--- + +## Usage + +```bash +# Run all baseline models +python3 main.py + +# Single model (auto-detects thinking mode) +python3 main.py --model granite4.1:8b + +# Variance analysis — 3 runs per model +python3 main.py --mode baseline --runs 3 + +# Auto-discover and test all models in ollama list +python3 main.py --test-all + +# Reports +python3 main.py --report # latest run per model +python3 main.py --report --report-best # best run per model + +# Fast run (no thermal cooldown) +python3 main.py --no-cooldown +``` + +--- + +## Configuration + +Edit `config.py`: + +```python +MODELS_BASELINE_DIRECT = ["granite4.1:8b", "qwen2.5-coder:14b"] +MODELS_BASELINE_THINKING = ["nemotron-3-nano:4b", "gemma4:e4b"] +JUDGE_MODEL = "qwen2.5:14b" # dedicated — never benchmarked +EMBED_MODEL = "nomic-embed-text" +``` + +--- + +## File Structure + +``` +benchmark_v4/ + config.py models, weights, settings + prompts.py all prompts, ground truths, judge rubrics + validators.py Layer 1: deterministic scoring + judge.py Layer 2: LLM judge + embedding similarity + scoring.py combines all layers into final scores + runner.py executes models, orchestrates benchmark + storage.py SQLite read/write (benchmark_v4.db) + reporting.py terminal output + main.py CLI entry point +``` + +--- + +## Results Database + +All results stored in `benchmark_v4.db` (SQLite, never deleted). + +```sql +-- Latest ranking +SELECT model, weighted_avg, stdev_all, failure_rate_pct +FROM runs +WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model) +ORDER BY weighted_avg DESC; + +-- Compliance rates +SELECT model, compliance_json, compliance_yaml, + compliance_tool, compliance_hall +FROM runs +WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model); + +-- Detailed test scores +SELECT test, semantic_score, format_score, notes +FROM details +WHERE model = 'granite4.1:8b' +AND run_id = (SELECT MAX(id) FROM runs WHERE model = 'granite4.1:8b'); +``` + +--- + +## Validated Stack (RTX 5060 Ti 16GB) + +| Model | Role | w | σ | fail% | +|---|---|---|---|---| +| granite4.1:8b | Reliable default | 6.85 | 0.81 | 0% | +| qwen2.5-coder:14b | Coding / infra | 6.69 | 1.15 | 0% | +| nemotron-3-nano:4b | Fast chat | 6.37 | 2.87 | 6% | +| gemma4:e4b | RAG / research | 6.06 | 2.56 | 6% | + +10 models tested. 6 rejected. Rankings stable across rebuilds. + +## Output Example by categories + +``` +=================================================================== + CATEGORY BREAKDOWN (latest run per model) +==================================================================== + + Model agent code rag struct hall reason + ---------------------------------------------------------------- + ★ gemma4:e4b 8.5 10.0 9.0 10.0 10.0 7.0 + ★ granite4.1:8b 10.0 10.0 9.0 7.5 10.0 7.67 + phi4:latest 10.0 10.0 9.0 6.5 10.0 7.0 + ★ nemotron-3-nano:4b 7.0 7.5 9.5 10.0 10.0 8.33 + lfm2:latest 10.0 7.5 9.0 10.0 4.0 8.0 + ★ qwen2.5-coder:14b 10.0 10.0 8.5 9.5 0.0 6.67 + mistral-nemo:12b 5.0 10.0 8.5 9.5 6.0 5.0 +``` +