Add README.md
This commit is contained in:
237
README.md
Normal file
237
README.md
Normal file
@@ -0,0 +1,237 @@
|
|||||||
|
# LLM Benchmark V4
|
||||||
|
|
||||||
|
A modular, SQLite-backed benchmark for evaluating local LLMs running on Ollama.
|
||||||
|
Designed for **operational reliability in agentic and automated pipelines** — not general intelligence.
|
||||||
|
It rewards format obedience, structured output correctness, tool call precision, and hallucination resistance.
|
||||||
|
It intentionally penalises verbosity and creative deviation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Philosophy
|
||||||
|
|
||||||
|
Most public benchmarks measure what a model knows. This one measures whether it can be trusted in production:
|
||||||
|
|
||||||
|
- Does it follow exact format instructions?
|
||||||
|
- Does it call tools correctly without adding noise?
|
||||||
|
- Does it refuse to fabricate facts?
|
||||||
|
- Is it consistent across multiple runs?
|
||||||
|
|
||||||
|
A model scoring `9, 9, 2, 8, 1` is worse for agents than one scoring `7, 7, 7, 7, 7`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Test Suite
|
||||||
|
|
||||||
|
16 tests across 6 categories, weighted by production relevance.
|
||||||
|
|
||||||
|
### Agent / Tool Reliability — 25%
|
||||||
|
|
||||||
|
| Test | What it measures |
|
||||||
|
|---|---|
|
||||||
|
| `tool_calling` | Returns a single valid function call with no extra text |
|
||||||
|
| `multi_step_agent` | Chains 3 tool calls in sequence and produces a final answer |
|
||||||
|
|
||||||
|
### Coding / Infrastructure — 25%
|
||||||
|
|
||||||
|
| Test | What it measures |
|
||||||
|
|---|---|
|
||||||
|
| `coding` | Produces a working LIS function with correct time complexity |
|
||||||
|
| `yaml_generation` | Returns valid parseable Kubernetes Deployment YAML |
|
||||||
|
| `artifact_mermaid` | Returns a valid Mermaid flowchart with all 8 pipeline stages |
|
||||||
|
| `json_schema` | Returns a valid JSON Schema with required fields and constraints |
|
||||||
|
|
||||||
|
### RAG / Context Fidelity — 20%
|
||||||
|
|
||||||
|
| Test | What it measures |
|
||||||
|
|---|---|
|
||||||
|
| `rag` | Summarises a provided document accurately without invention |
|
||||||
|
| `context_begin` | Retrieves a fact from the beginning of a document |
|
||||||
|
| `context_middle` | Retrieves a fact from the middle of a document |
|
||||||
|
| `context_end` | Retrieves a fact from the end of a document |
|
||||||
|
|
||||||
|
### Structured Outputs — 15%
|
||||||
|
|
||||||
|
| Test | What it measures |
|
||||||
|
|---|---|
|
||||||
|
| `structured` | Returns nested JSON with typed fields (recommendations array) |
|
||||||
|
| `compression` | Compresses content into exactly 10 bullet points preserving all industries |
|
||||||
|
|
||||||
|
### Hallucination Resistance — 10%
|
||||||
|
|
||||||
|
| Test | What it measures |
|
||||||
|
|---|---|
|
||||||
|
| `hallucination` | Refuses to describe a non-existent book — rewards uncertainty, penalises invention |
|
||||||
|
|
||||||
|
### Pure Reasoning — 5%
|
||||||
|
|
||||||
|
| Test | What it measures |
|
||||||
|
|---|---|
|
||||||
|
| `reasoning` | Solves a multi-step percentage problem correctly |
|
||||||
|
| `math` | Solves a rate problem requiring correct reasoning about independence |
|
||||||
|
| `agent` | Plans a search strategy meeting 5 explicit requirements |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scoring Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
Raw output
|
||||||
|
↓
|
||||||
|
normalize_text() strip ANSI, thinking tokens, Ollama stats
|
||||||
|
↓
|
||||||
|
Layer 1: Deterministic Validator
|
||||||
|
0 or 10 → skip judge (definitive)
|
||||||
|
1–9 → blend with judge (80% validator / 20% judge)
|
||||||
|
↓
|
||||||
|
Layer 2: Semantic Judge (only when needed)
|
||||||
|
qwen2.5:14b with strict rubric — never benchmarked
|
||||||
|
↓
|
||||||
|
Layer 3: Embedding Similarity (RAG test only)
|
||||||
|
nomic-embed-text via Ollama
|
||||||
|
↓
|
||||||
|
format_score (separate)
|
||||||
|
ANSI codes, word limit, markdown obedience
|
||||||
|
↓
|
||||||
|
combined = semantic × 0.8 + format × 0.2
|
||||||
|
weighted_avg = Σ(semantic × test_weight)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What the Numbers Mean
|
||||||
|
|
||||||
|
| Metric | Description |
|
||||||
|
|---|---|
|
||||||
|
| `w` | Weighted semantic average — primary score |
|
||||||
|
| `σ` | Standard deviation across tests — lower is more reliable |
|
||||||
|
| `fail%` | Percentage of tests scoring ≤ 2/10 — hard failures |
|
||||||
|
| `tok/s` | Generation speed on this hardware |
|
||||||
|
| `🌡` | Average GPU temperature during benchmark |
|
||||||
|
|
||||||
|
**Compliance rates** track pass rate (score ≥ 8) for:
|
||||||
|
- JSON — nested structured output
|
||||||
|
- YAML — Kubernetes manifest generation
|
||||||
|
- Tool — function call format
|
||||||
|
- Hallucination — refusal of fabricated content
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install pyyaml rapidfuzz requests
|
||||||
|
# Ollama running with: judge model, embed model, and models under test
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run all baseline models
|
||||||
|
python3 main.py
|
||||||
|
|
||||||
|
# Single model (auto-detects thinking mode)
|
||||||
|
python3 main.py --model granite4.1:8b
|
||||||
|
|
||||||
|
# Variance analysis — 3 runs per model
|
||||||
|
python3 main.py --mode baseline --runs 3
|
||||||
|
|
||||||
|
# Auto-discover and test all models in ollama list
|
||||||
|
python3 main.py --test-all
|
||||||
|
|
||||||
|
# Reports
|
||||||
|
python3 main.py --report # latest run per model
|
||||||
|
python3 main.py --report --report-best # best run per model
|
||||||
|
|
||||||
|
# Fast run (no thermal cooldown)
|
||||||
|
python3 main.py --no-cooldown
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
Edit `config.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
MODELS_BASELINE_DIRECT = ["granite4.1:8b", "qwen2.5-coder:14b"]
|
||||||
|
MODELS_BASELINE_THINKING = ["nemotron-3-nano:4b", "gemma4:e4b"]
|
||||||
|
JUDGE_MODEL = "qwen2.5:14b" # dedicated — never benchmarked
|
||||||
|
EMBED_MODEL = "nomic-embed-text"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
benchmark_v4/
|
||||||
|
config.py models, weights, settings
|
||||||
|
prompts.py all prompts, ground truths, judge rubrics
|
||||||
|
validators.py Layer 1: deterministic scoring
|
||||||
|
judge.py Layer 2: LLM judge + embedding similarity
|
||||||
|
scoring.py combines all layers into final scores
|
||||||
|
runner.py executes models, orchestrates benchmark
|
||||||
|
storage.py SQLite read/write (benchmark_v4.db)
|
||||||
|
reporting.py terminal output
|
||||||
|
main.py CLI entry point
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Results Database
|
||||||
|
|
||||||
|
All results stored in `benchmark_v4.db` (SQLite, never deleted).
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Latest ranking
|
||||||
|
SELECT model, weighted_avg, stdev_all, failure_rate_pct
|
||||||
|
FROM runs
|
||||||
|
WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model)
|
||||||
|
ORDER BY weighted_avg DESC;
|
||||||
|
|
||||||
|
-- Compliance rates
|
||||||
|
SELECT model, compliance_json, compliance_yaml,
|
||||||
|
compliance_tool, compliance_hall
|
||||||
|
FROM runs
|
||||||
|
WHERE id IN (SELECT MAX(id) FROM runs GROUP BY model);
|
||||||
|
|
||||||
|
-- Detailed test scores
|
||||||
|
SELECT test, semantic_score, format_score, notes
|
||||||
|
FROM details
|
||||||
|
WHERE model = 'granite4.1:8b'
|
||||||
|
AND run_id = (SELECT MAX(id) FROM runs WHERE model = 'granite4.1:8b');
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Validated Stack (RTX 5060 Ti 16GB)
|
||||||
|
|
||||||
|
| Model | Role | w | σ | fail% |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| granite4.1:8b | Reliable default | 6.85 | 0.81 | 0% |
|
||||||
|
| qwen2.5-coder:14b | Coding / infra | 6.69 | 1.15 | 0% |
|
||||||
|
| nemotron-3-nano:4b | Fast chat | 6.37 | 2.87 | 6% |
|
||||||
|
| gemma4:e4b | RAG / research | 6.06 | 2.56 | 6% |
|
||||||
|
|
||||||
|
10 models tested. 6 rejected. Rankings stable across rebuilds.
|
||||||
|
|
||||||
|
## Output Example by categories
|
||||||
|
|
||||||
|
```
|
||||||
|
===================================================================
|
||||||
|
CATEGORY BREAKDOWN (latest run per model)
|
||||||
|
====================================================================
|
||||||
|
|
||||||
|
Model agent code rag struct hall reason
|
||||||
|
----------------------------------------------------------------
|
||||||
|
★ gemma4:e4b 8.5 10.0 9.0 10.0 10.0 7.0
|
||||||
|
★ granite4.1:8b 10.0 10.0 9.0 7.5 10.0 7.67
|
||||||
|
phi4:latest 10.0 10.0 9.0 6.5 10.0 7.0
|
||||||
|
★ nemotron-3-nano:4b 7.0 7.5 9.5 10.0 10.0 8.33
|
||||||
|
lfm2:latest 10.0 7.5 9.0 10.0 4.0 8.0
|
||||||
|
★ qwen2.5-coder:14b 10.0 10.0 8.5 9.5 0.0 6.67
|
||||||
|
mistral-nemo:12b 5.0 10.0 8.5 9.5 6.0 5.0
|
||||||
|
```
|
||||||
|
|
||||||
Reference in New Issue
Block a user