Methodology
The grading architecture is layered and weighted toward determinism. LLM-as-judge is the last resort, not the spine.
Eight problems in current evaluation
These are the structural failure modes we designed around.
Single-judge evaluation
One LLM grades another. Correlated bias — judge and judged share training-data artifacts. Conflates judge capability with subject capability.
Ceiling-saturated scales
Benchmarks where frontier models score near-perfectly cannot measure further improvement. The score becomes meaningless within months.
Harness conflation
The same model can swing 20+ points depending on scaffolding. Most benchmarks don't separate model capability from harness capability.
Output-only evaluation
Two models can reach the same correct answer through wildly different reasoning trajectories — one efficient, one bluffing. Output-only evaluation can't distinguish them.
Domain conflation
A model excellent at general retrieval may be poor at clinical retrieval. If the eval doesn't decompose by domain, that signal is invisible.
Unprincipled rubrics
"Rate this 1–10" with no anchor definitions produces calibration drift across runs, judges, and time.
Prose-based verification
Most evals verify prose with another LLM or fuzzy similarity. Prose surfaces are inherently fuzzy; verification inherits the fuzziness.
Untethered retrieval
Models that confabulate plausible content about no specific object pass the same checks as ones that correctly retrieve. No identifier anchor means no disambiguation.