Files
llm-benchmark/README.md
2026-05-14 14:06:00 +00:00

238 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# LLM Benchmark V4
A modular, SQLite-backed benchmark for evaluating local LLMs running on Ollama.
Designed for **operational reliability in agentic and automated pipelines** — not general intelligence.
It rewards format obedience, structured output correctness, tool call precision, and hallucination resistance.
It intentionally penalises verbosity and creative deviation.
---
## Philosophy
Most public benchmarks measure what a model knows. This one measures whether it can be trusted in production:
- Does it follow exact format instructions?
- Does it call tools correctly without adding noise?
- Does it refuse to fabricate facts?
- Is it consistent across multiple runs?
A model scoring `9, 9, 2, 8, 1` is worse for agents than one scoring `7, 7, 7, 7, 7`.
---
## Test Suite
16 tests across 6 categories, weighted by production relevance.
### Agent / Tool Reliability — 25%
| Test | What it measures |
|---|---|
| `tool_calling` | Returns a single valid function call with no extra text |
| `multi_step_agent` | Chains 3 tool calls in sequence and produces a final answer |
### Coding / Infrastructure — 25%
| Test | What it measures |
|---|---|
| `coding` | Produces a working LIS function with correct time complexity |
| `yaml_generation` | Returns valid parseable Kubernetes Deployment YAML |
| `artifact_mermaid` | Returns a valid Mermaid flowchart with all 8 pipeline stages |
| `json_schema` | Returns a valid JSON Schema with required fields and constraints |
### RAG / Context Fidelity — 20%
| Test | What it measures |
|---|---|
| `rag` | Summarises a provided document accurately without invention |
| `context_begin` | Retrieves a fact from the beginning of a document |
| `context_middle` | Retrieves a fact from the middle of a document |
| `context_end` | Retrieves a fact from the end of a document |
### Structured Outputs — 15%
| Test | What it measures |
|---|---|
| `structured` | Returns nested JSON with typed fields (recommendations array) |
| `compression` | Compresses content into exactly 10 bullet points preserving all industries |
### Hallucination Resistance — 10%
| Test | What it measures |
|---|---|
| `hallucination` | Refuses to describe a non-existent book — rewards uncertainty, penalises invention |
### Pure Reasoning — 5%
| Test | What it measures |
|---|---|
| `reasoning` | Solves a multi-step percentage problem correctly |
| `math` | Solves a rate problem requiring correct reasoning about independence |
| `agent` | Plans a search strategy meeting 5 explicit requirements |
---
## Scoring Architecture
```
Raw output
normalize_text() strip ANSI, thinking tokens, Ollama stats
Layer 1: Deterministic Validator
0 or 10 → skip judge (definitive)
19 → blend with judge (80% validator / 20% judge)
Layer 2: Semantic Judge (only when needed)
qwen2.5:14b with strict rubric — never benchmarked
Layer 3: Embedding Similarity (RAG test only)
nomic-embed-text via Ollama
format_score (separate)
ANSI codes, word limit, markdown obedience
combined = semantic × 0.8 + format × 0.2
weighted_avg = Σ(semantic × test_weight)
```
---
## What the Numbers Mean
| Metric | Description |
|---|---|
| `w` | Weighted semantic average — primary score |
| `σ` | Standard deviation across tests — lower is more reliable |
| `fail%` | Percentage of tests scoring ≤ 2/10 — hard failures |
| `tok/s` | Generation speed on this hardware |
| `🌡` | Average GPU temperature during benchmark |
**Compliance rates** track pass rate (score ≥ 8) for:
- JSON — nested structured output
- YAML — Kubernetes manifest generation
- Tool — function call format
- Hallucination — refusal of fabricated content
---
## Requirements
```bash
pip install pyyaml rapidfuzz requests
# Ollama running with: judge model, embed model, and models under test
```
---
## Usage
```bash
# Run all baseline models
python3 main.py
# Single model (auto-detects thinking mode)
python3 main.py --model granite4.1:8b
# Variance analysis — 3 runs per model
python3 main.py --mode baseline --runs 3
# Auto-discover and test all models in ollama list
python3 main.py --test-all
# Reports
python3 main.py --report # latest run per model
python3 main.py --report --report-best # best run per model
# Fast run (no thermal cooldown)
python3 main.py --no-cooldown
```
---
## Configuration
Edit `config.py`:
```python
MODELS_BASELINE_DIRECT = ["granite4.1:8b", "qwen2.5-coder:14b"]
MODELS_BASELINE_THINKING = ["nemotron-3-nano:4b", "gemma4:e4b"]
JUDGE_MODEL = "qwen2.5:14b" # dedicated — never benchmarked
EMBED_MODEL = "nomic-embed-text"
```
---
## File Structure
```
benchmark_v4/
config.py models, weights, settings
prompts.py all prompts, ground truths, judge rubrics
validators.py Layer 1: deterministic scoring
judge.py Layer 2: LLM judge + embedding similarity
scoring.py combines all layers into final scores
runner.py executes models, orchestrates benchmark
storage.py SQLite read/write (benchmark_v4.db)
reporting.py terminal output
main.py CLI entry point
```
---
## Results Database
All results stored in `benchmark_v4.db` (SQLite, never deleted).
```sql
-- Latest ranking
SELECT model, weighted_avg, stdev_all, failure_rate_pct
FROM runs
WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model)
ORDER BY weighted_avg DESC;
-- Compliance rates
SELECT model, compliance_json, compliance_yaml,
compliance_tool, compliance_hall
FROM runs
WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model);
-- Detailed test scores
SELECT test, semantic_score, format_score, notes
FROM details
WHERE model = 'granite4.1:8b'
AND run_id = (SELECT MAX(id) FROM runs WHERE model = 'granite4.1:8b');
```
---
## Validated Stack (RTX 5060 Ti 16GB)
| Model | Role | w | σ | fail% |
|---|---|---|---|---|
| granite4.1:8b | Reliable default | 6.85 | 0.81 | 0% |
| qwen2.5-coder:14b | Coding / infra | 6.69 | 1.15 | 0% |
| nemotron-3-nano:4b | Fast chat | 6.37 | 2.87 | 6% |
| gemma4:e4b | RAG / research | 6.06 | 2.56 | 6% |
10 models tested. 6 rejected. Rankings stable across rebuilds.
## Output Example by categories
```
===================================================================
CATEGORY BREAKDOWN (latest run per model)
====================================================================
Model agent code rag struct hall reason
----------------------------------------------------------------
★ gemma4:e4b 8.5 10.0 9.0 10.0 10.0 7.0
★ granite4.1:8b 10.0 10.0 9.0 7.5 10.0 7.67
phi4:latest 10.0 10.0 9.0 6.5 10.0 7.0
★ nemotron-3-nano:4b 7.0 7.5 9.5 10.0 10.0 8.33
lfm2:latest 10.0 7.5 9.0 10.0 4.0 8.0
★ qwen2.5-coder:14b 10.0 10.0 8.5 9.5 0.0 6.67
mistral-nemo:12b 5.0 10.0 8.5 9.5 6.0 5.0
```