Eval Pipeline โ How to Measure Harness Quality
Gut feeling loses โ automated evaluation systems are essential
"I changed the prompt and it feels better" โ that judgment is dangerous. Better on 10 cases might mean worse on 10 others.
Eval Pipeline Structure
Test set โ collection of input-expected output pairs. Extract from real use cases. Synthetic data alone isn't enough.
Runner โ runs the test set through the harness. Must support A/B comparison under identical conditions.
Scorer โ quantifies output quality. Auto-scoring for clear answers (code โ test pass/fail), LLM-as-judge or human eval for natural language.
Tracking โ record inputs, intermediate steps, outputs, and scores for every run. Must be able to trace what improved and what degraded.
Metric Design
For code generation harness:
Test pass rate (pass@1, pass@5)
Patch accuracy (did it modify the right files?)
Unnecessary change ratio (did it touch unrelated files?)
For RAG harness:
Retrieval precision (are fetched documents relevant?)
Answer faithfulness (is the answer grounded in search results?)
Hallucination rate
Gut vs Data
"This prompt seems better" is gut feeling. "pass@1 went from 52% to 61%, and failure case analysis shows improved error message parsing" is data.
Harness engineering makes decisions with data, not gut feeling.
How It Works
Build test set โ minimum 50+ input-expected output pairs from real use cases
Auto-run + scoring โ run entire test set on every harness change and compare scores
Failure case analysis โ analyze cases where scores dropped to find root causes
Regression prevention โ verify existing cases don't break when adding new features