AI Has Outgrown Its Own Exams

The scoreboard that ranked machines for a decade is going blind. Harder tests will not fix it, because the problem is the fixed test itself.

A leaderboard is a comfort. It turns a dizzying question, "is this machine intelligent?", into a number you can compare. For ten years, artificial intelligence lived under the discipline of the score. Today the best models clear 88 percent on the major knowledge exams, packed so tightly that the gap between first place and tenth means nothing. The needle no longer moves; it is pinned to the top.

Saturation is not a technical footnote. MMLU, the reference exam, fell in 2024, and others fell before it. Every test built to last for years gives way in months. And the more famous an exam becomes, the more its questions seep into training data. The model stops reasoning and starts remembering. The score climbs for the wrong reason.

The Law That Eats the Measure

It is an old rule, stated by the economist Charles Goodhart: once a measure becomes a target, it stops being a good measure. AI gives the cleanest demonstration yet. Labs optimize for the scoreboard, the scoreboard drifts from reality, and the gap widens. On enterprise tasks, the distance between lab-reported performance and what shows up in production reaches 37 percent.

The industry's reflex is to build harder exams, sealed against contamination. They will hold for a year, maybe two. The trouble is not the difficulty of the questions; it is the very idea of a fixed exam. An intelligence worthy of the name should, by definition, overflow the box we try to lock it in.

Measuring Trust, Not Knowledge

The real shift lies elsewhere. The useful question is no longer "how much does it know?" but "how far can I trust it, here, on my task, at what cost?". For the same accuracy, costs vary by a factor of fifty. Factual error rates spread from 22 to 94 percent across models. No single number captures that.

So we enter an age without a shared compass, exactly when investment and deployment decisions are crying out for one. We will have to relearn how to judge without a grade, through the test of the field rather than the ranking. It is uncomfortable, and more honest: you only measure a machine well by watching it work.