๐Ÿ“Š

Eval Pipeline โ€” How to Measure Harness Quality

Gut feeling loses โ€” automated evaluation systems are essential

"I changed the prompt and it feels better" โ€” that judgment is dangerous. Better on 10 cases might mean worse on 10 others.

Eval Pipeline Structure

Test set โ€” collection of input-expected output pairs. Extract from real use cases. Synthetic data alone isn't enough.

Runner โ€” runs the test set through the harness. Must support A/B comparison under identical conditions.

Scorer โ€” quantifies output quality. Auto-scoring for clear answers (code โ€” test pass/fail), LLM-as-judge or human eval for natural language.

Tracking โ€” record inputs, intermediate steps, outputs, and scores for every run. Must be able to trace what improved and what degraded.

Metric Design

For code generation harness:

  • Test pass rate (pass@1, pass@5)

  • Patch accuracy (did it modify the right files?)

  • Unnecessary change ratio (did it touch unrelated files?)

For RAG harness:

  • Retrieval precision (are fetched documents relevant?)

  • Answer faithfulness (is the answer grounded in search results?)

  • Hallucination rate

Gut vs Data

"This prompt seems better" is gut feeling. "pass@1 went from 52% to 61%, and failure case analysis shows improved error message parsing" is data.

Harness engineering makes decisions with data, not gut feeling.

How It Works

1

Build test set โ€” minimum 50+ input-expected output pairs from real use cases

2

Auto-run + scoring โ€” run entire test set on every harness change and compare scores

3

Failure case analysis โ€” analyze cases where scores dropped to find root causes

4

Regression prevention โ€” verify existing cases don't break when adding new features

Use Cases

SWE-bench โ€” benchmark evaluating coding agent harnesses based on real GitHub issues RAG evaluation โ€” measuring RAG harness quality via retrieval precision, faithfulness, hallucination rate