Add README.md

2026-05-14 14:06:00 +00:00
commit 51e9389726
1 changed files with 237 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,237 @@
+# LLM Benchmark V4
+
+A modular, SQLite-backed benchmark for evaluating local LLMs running on Ollama.
+Designed for **operational reliability in agentic and automated pipelines** — not general intelligence.
+It rewards format obedience, structured output correctness, tool call precision, and hallucination resistance.
+It intentionally penalises verbosity and creative deviation.
+
+---
+
+## Philosophy
+
+Most public benchmarks measure what a model knows. This one measures whether it can be trusted in production:
+
+- Does it follow exact format instructions?
+- Does it call tools correctly without adding noise?
+- Does it refuse to fabricate facts?
+- Is it consistent across multiple runs?
+
+A model scoring `9, 9, 2, 8, 1` is worse for agents than one scoring `7, 7, 7, 7, 7`.
+
+---
+
+## Test Suite
+
+16 tests across 6 categories, weighted by production relevance.
+
+### Agent / Tool Reliability — 25%
+
+| Test | What it measures |
+|---|---|
+| `tool_calling` | Returns a single valid function call with no extra text |
+| `multi_step_agent` | Chains 3 tool calls in sequence and produces a final answer |
+
+### Coding / Infrastructure — 25%
+
+| Test | What it measures |
+|---|---|
+| `coding` | Produces a working LIS function with correct time complexity |
+| `yaml_generation` | Returns valid parseable Kubernetes Deployment YAML |
+| `artifact_mermaid` | Returns a valid Mermaid flowchart with all 8 pipeline stages |
+| `json_schema` | Returns a valid JSON Schema with required fields and constraints |
+
+### RAG / Context Fidelity — 20%
+
+| Test | What it measures |
+|---|---|
+| `rag` | Summarises a provided document accurately without invention |
+| `context_begin` | Retrieves a fact from the beginning of a document |
+| `context_middle` | Retrieves a fact from the middle of a document |
+| `context_end` | Retrieves a fact from the end of a document |
+
+### Structured Outputs — 15%
+
+| Test | What it measures |
+|---|---|
+| `structured` | Returns nested JSON with typed fields (recommendations array) |
+| `compression` | Compresses content into exactly 10 bullet points preserving all industries |
+
+### Hallucination Resistance — 10%
+
+| Test | What it measures |
+|---|---|
+| `hallucination` | Refuses to describe a non-existent book — rewards uncertainty, penalises invention |
+
+### Pure Reasoning — 5%
+
+| Test | What it measures |
+|---|---|
+| `reasoning` | Solves a multi-step percentage problem correctly |
+| `math` | Solves a rate problem requiring correct reasoning about independence |
+| `agent` | Plans a search strategy meeting 5 explicit requirements |
+
+---
+
+## Scoring Architecture
+
+```
+Raw output
+  ↓
+normalize_text()        strip ANSI, thinking tokens, Ollama stats
+  ↓
+Layer 1: Deterministic Validator
+  0 or 10 → skip judge (definitive)
+  1–9     → blend with judge (80% validator / 20% judge)
+  ↓
+Layer 2: Semantic Judge (only when needed)
+  qwen2.5:14b with strict rubric — never benchmarked
+  ↓
+Layer 3: Embedding Similarity (RAG test only)
+  nomic-embed-text via Ollama
+  ↓
+format_score (separate)
+  ANSI codes, word limit, markdown obedience
+  ↓
+combined = semantic × 0.8 + format × 0.2
+weighted_avg = Σ(semantic × test_weight)
+```
+
+---
+
+## What the Numbers Mean
+
+| Metric | Description |
+|---|---|
+| `w` | Weighted semantic average — primary score |
+| `σ` | Standard deviation across tests — lower is more reliable |
+| `fail%` | Percentage of tests scoring ≤ 2/10 — hard failures |
+| `tok/s` | Generation speed on this hardware |
+| `🌡` | Average GPU temperature during benchmark |
+
+**Compliance rates** track pass rate (score ≥ 8) for:
+- JSON — nested structured output
+- YAML — Kubernetes manifest generation
+- Tool — function call format
+- Hallucination — refusal of fabricated content
+
+---
+
+## Requirements
+
+```bash
+pip install pyyaml rapidfuzz requests
+# Ollama running with: judge model, embed model, and models under test
+```
+
+---
+
+## Usage
+
+```bash
+# Run all baseline models
+python3 main.py
+
+# Single model (auto-detects thinking mode)
+python3 main.py --model granite4.1:8b
+
+# Variance analysis — 3 runs per model
+python3 main.py --mode baseline --runs 3
+
+# Auto-discover and test all models in ollama list
+python3 main.py --test-all
+
+# Reports
+python3 main.py --report               # latest run per model
+python3 main.py --report --report-best # best run per model
+
+# Fast run (no thermal cooldown)
+python3 main.py --no-cooldown
+```
+
+---
+
+## Configuration
+
+Edit `config.py`:
+
+```python
+MODELS_BASELINE_DIRECT   = ["granite4.1:8b", "qwen2.5-coder:14b"]
+MODELS_BASELINE_THINKING = ["nemotron-3-nano:4b", "gemma4:e4b"]
+JUDGE_MODEL              = "qwen2.5:14b"   # dedicated — never benchmarked
+EMBED_MODEL              = "nomic-embed-text"
+```
+
+---
+
+## File Structure
+
+```
+benchmark_v4/
+  config.py      models, weights, settings
+  prompts.py     all prompts, ground truths, judge rubrics
+  validators.py  Layer 1: deterministic scoring
+  judge.py       Layer 2: LLM judge + embedding similarity
+  scoring.py     combines all layers into final scores
+  runner.py      executes models, orchestrates benchmark
+  storage.py     SQLite read/write (benchmark_v4.db)
+  reporting.py   terminal output
+  main.py        CLI entry point
+```
+
+---
+
+## Results Database
+
+All results stored in `benchmark_v4.db` (SQLite, never deleted).
+
+```sql
+-- Latest ranking
+SELECT model, weighted_avg, stdev_all, failure_rate_pct
+FROM runs
+WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model)
+ORDER BY weighted_avg DESC;
+
+-- Compliance rates
+SELECT model, compliance_json, compliance_yaml,
+       compliance_tool, compliance_hall
+FROM runs
+WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model);
+
+-- Detailed test scores
+SELECT test, semantic_score, format_score, notes
+FROM details
+WHERE model = 'granite4.1:8b'
+AND run_id = (SELECT MAX(id) FROM runs WHERE model = 'granite4.1:8b');
+```
+
+---
+
+## Validated Stack (RTX 5060 Ti 16GB)
+
+| Model | Role | w | σ | fail% |
+|---|---|---|---|---|
+| granite4.1:8b | Reliable default | 6.85 | 0.81 | 0% |
+| qwen2.5-coder:14b | Coding / infra | 6.69 | 1.15 | 0% |
+| nemotron-3-nano:4b | Fast chat | 6.37 | 2.87 | 6% |
+| gemma4:e4b | RAG / research | 6.06 | 2.56 | 6% |
+
+10 models tested. 6 rejected. Rankings stable across rebuilds.
+
+## Output Example by categories
+
+```
+===================================================================
+   CATEGORY BREAKDOWN (latest run per model)
+====================================================================
+
+  Model                                     agent   code    rag  struct  hall  reason
+  ----------------------------------------------------------------
+  ★ gemma4:e4b                                   8.5    10.0     9.0     10.0   10.0      7.0
+  ★ granite4.1:8b                               10.0    10.0     9.0      7.5   10.0     7.67
+    phi4:latest                                 10.0    10.0     9.0      6.5   10.0      7.0
+  ★ nemotron-3-nano:4b                           7.0     7.5     9.5     10.0   10.0     8.33
+    lfm2:latest                                 10.0     7.5     9.0     10.0    4.0      8.0
+  ★ qwen2.5-coder:14b                           10.0    10.0     8.5      9.5    0.0     6.67
+    mistral-nemo:12b                             5.0    10.0     8.5      9.5    6.0      5.0
+```
+