Methodology

The grading architecture is layered and weighted toward determinism. LLM-as-judge is the last resort, not the spine.

Eight problems in current evaluation

These are the structural failure modes we designed around.

1

Single-judge evaluation

One LLM grades another. Correlated bias — judge and judged share training-data artifacts. Conflates judge capability with subject capability.

2

Ceiling-saturated scales

Benchmarks where frontier models score near-perfectly cannot measure further improvement. The score becomes meaningless within months.

3

Harness conflation

The same model can swing 20+ points depending on scaffolding. Most benchmarks don't separate model capability from harness capability.

4

Output-only evaluation

Two models can reach the same correct answer through wildly different reasoning trajectories — one efficient, one bluffing. Output-only evaluation can't distinguish them.

5

Domain conflation

A model excellent at general retrieval may be poor at clinical retrieval. If the eval doesn't decompose by domain, that signal is invisible.

6

Unprincipled rubrics

"Rate this 1–10" with no anchor definitions produces calibration drift across runs, judges, and time.

7

Prose-based verification

Most evals verify prose with another LLM or fuzzy similarity. Prose surfaces are inherently fuzzy; verification inherits the fuzziness.

8

Untethered retrieval

Models that confabulate plausible content about no specific object pass the same checks as ones that correctly retrieve. No identifier anchor means no disambiguation.