📊

Eval Pipeline — How to Measure Harness Quality

Gut feeling loses — automated evaluation systems are essential

"I changed the prompt and it feels better" — that judgment is dangerous. Better on 10 cases might mean worse on 10 others.

Eval Pipeline Structure

Test set — collection of input-expected output pairs. Extract from real use cases. Synthetic data alone isn't enough.

Runner — runs the test set through the harness. Must support A/B comparison under identical conditions.

Scorer — quantifies output quality. Auto-scoring for clear answers (code — test pass/fail), LLM-as-judge or human eval for natural language.

Tracking — record inputs, intermediate steps, outputs, and scores for every run. Must be able to trace what improved and what degraded.

Metric Design

For code generation harness:

Test pass rate (pass@1, pass@5)
Patch accuracy (did it modify the right files?)
Unnecessary change ratio (did it touch unrelated files?)

For RAG harness:

Retrieval precision (are fetched documents relevant?)
Answer faithfulness (is the answer grounded in search results?)
Hallucination rate

Gut vs Data

"This prompt seems better" is gut feeling. "pass@1 went from 52% to 61%, and failure case analysis shows improved error message parsing" is data.

Harness engineering makes decisions with data, not gut feeling.

How It Works

Build test set — minimum 50+ input-expected output pairs from real use cases

Auto-run + scoring — run entire test set on every harness change and compare scores

Failure case analysis — analyze cases where scores dropped to find root causes

Regression prevention — verify existing cases don't break when adding new features

Use Cases

SWE-bench — benchmark evaluating coding agent harnesses based on real GitHub issues RAG evaluation — measuring RAG harness quality via retrieval precision, faithfulness, hallucination rate

References

🔗 SWE-bench Official Site
🔗 Anthropic — Evaluating AI Systems

⚖️

Harness Engineering vs Prompt Engineering

Writing good prompts and building good execution environments are different things

→

← 🛡️ Guardrails Design — Stopping Models Before They Go Wrong 🧠 Context Window Management — Strategy for Finite Memory →