Methodology

The grading architecture is layered and weighted toward determinism. LLM-as-judge is the last resort, not the spine.

Eight problems in current evaluation

These are the structural failure modes we designed around.

Single-judge evaluation

One LLM grades another. Correlated bias — judge and judged share training-data artifacts. Conflates judge capability with subject capability.

Ceiling-saturated scales

Benchmarks where frontier models score near-perfectly cannot measure further improvement. The score becomes meaningless within months.

Harness conflation

The same model can swing 20+ points depending on scaffolding. Most benchmarks don't separate model capability from harness capability.

Output-only evaluation

Two models can reach the same correct answer through wildly different reasoning trajectories — one efficient, one bluffing. Output-only evaluation can't distinguish them.

Domain conflation

A model excellent at general retrieval may be poor at clinical retrieval. If the eval doesn't decompose by domain, that signal is invisible.

Unprincipled rubrics

"Rate this 1–10" with no anchor definitions produces calibration drift across runs, judges, and time.

Prose-based verification

Most evals verify prose with another LLM or fuzzy similarity. Prose surfaces are inherently fuzzy; verification inherits the fuzziness.

Untethered retrieval

Models that confabulate plausible content about no specific object pass the same checks as ones that correctly retrieve. No identifier anchor means no disambiguation.