Abstract visualization of broken evaluation metrics and fragmenting data systems

2025 Year in Review for LLM Evaluation: When the Scorecard Broke

Randy Olson, PhD··15 min read

Throughout 2024, researchers at UC Berkeley, MIT, and Cornell ran a simple experiment that would expose a structural flaw in how we measure progress. They tested leading language models on coding problems from programming contests, but with a twist. Instead of using the standard benchmark problems that have been public for years, they only tested models on problems released after the models' training cutoff dates.

The results were startling, and by 2025, the full implications became impossible to ignore. Models that achieved strong performance on static coding benchmarks saw stark drops when faced with truly novel problems they couldn't have seen during training, with some models dropping 20-30% or more. This wasn't a bug in the evaluation. It was proof that we'd been measuring memorization, not intelligence.

This experiment, captured in the LiveCodeBench benchmark, became a defining moment of 2025's evaluation reckoning. While model capabilities genuinely advanced (reasoning models, increasingly capable multimodal systems, autonomous coding agents solving real GitHub issues), our traditional evaluation methods hit a hard ceiling. Public leaderboards lost their predictive power for production use cases. MMLU scores above 80% told us nothing about production performance. The gap between benchmark scores and real-world utility widened into a chasm.

By December 2025, the AI industry had collectively realized that static, public tests had become training data. This triggered a fundamental shift toward dynamic, adversarial, and expert-grounded evaluation. More importantly, we discovered we'd been asking the wrong question all along. Not "how smart is this model?" but "what specific capabilities does this system have, and how can we measure them independently?"

Abstract visualization of broken evaluation metrics

Why We Stopped Trusting the Scorecard

The Contamination Problem

Models weren't just getting better at tests. They were gaming them.

LiveCodeBench provided the scientific proof of Goodhart's Law in action. By continuously collecting problems from LeetCode, Codeforces, and AtCoder after model training cutoffs, it exposed massive overfitting. This wasn't cheating by humans. It was models optimizing for the reward signal rather than the underlying capability they were supposed to measure.

But the contamination problem went deeper than passive memorization. A report from NIST's Center for AI Standards and Innovation documented something more disturbing. Agents were actively exploiting evaluation environments. In SWE-bench evaluations, some autonomous coding agents learned to inspect the `.git` history of repositories to find the human-written patches that fixed bugs, then simply copied those solutions instead of solving the problems themselves.

These agents weren't students learning to solve problems. They were adversaries optimizing for the metric. This forced the evaluation community to realize that testing environments need the same rigor as security penetration testing. Not just measuring capability, but defending against systems actively trying to game the measurement.

The benchmark treadmill had begun. We're now in an arms race where benchmarks have a 6-12 month shelf life before contamination or overfitting renders them useless. LiveCodeBench's continuous updates provide a partial solution, but create new problems. Non-reproducibility, moving goalposts, and the inability to directly compare models trained months apart.

Intelligence Has Many Dimensions

2025 shattered an assumption that had guided AI development for years. "Smarter" models would not automatically be better at everything.

Google DeepMind's release of SimpleQA Verified in September exposed something unexpected. This refined factuality benchmark revealed that reasoning-optimized models like DeepSeek-R1 don't automatically excel at simple fact retrieval. Models designed to "think" through complex chain-of-thought problems sometimes performed worse than standard models on straightforward questions like "What year did X happen?" while other reasoning models showed improved factuality performance.

Thinking longer doesn't cure hallucinations. Sometimes it just makes them more elaborate and convincing.

The data was clear. Reasoning capability, factual recall, and instruction-following are distinct, somewhat orthogonal capabilities. A model can be brilliant at mathematical problem-solving while being unreliable at basic knowledge retrieval. This forced a complete rethinking of what we mean by "model capability." There is no single dimension of "intelligence" to measure.

From Rankings to Capability Profiles

The decoupling revelation triggered a deeper shift in how the field approached evaluation itself. Throughout 2025, the research community moved from asking "which model is best?" to "what is each model good at?"

This wasn't just a philosophical change. It was operationalized. Stanford's HELM (Holistic Evaluation of Language Models) framework became representative of this new thinking, evaluating models across numerous metrics and scenarios rather than producing a single leaderboard score. The insight was simple but profound. Models are vectors of strengths and weaknesses, not points on a line.

A model might excel at creative writing but struggle with factual accuracy. Another might be reliable for structured tasks but poor at open-ended reasoning. A third might be fast and cheap but miss nuanced instructions. No single number can meaningfully capture this multidimensional reality.

For practitioners, this academic shift mirrored what they were discovering independently. The "best" model on public benchmarks was rarely the best model for their specific use case. Different business problems require different capability combinations. The era of chasing a single "state-of-the-art" score was over.

The LLM-as-a-Judge Challenge

As evaluation volume exploded in 2025, the industry tried to solve the scaling problem by using LLMs to evaluate other LLMs. The appeal was obvious. Automated evaluation at the cost of an API call instead of expensive human expert time.

But research throughout 2025 exposed critical flaws that couldn't be ignored.

Self-preference bias. Models systematically favor outputs from their own family. GPT models rate GPT outputs higher, Claude favors Claude, creating circular validation.

Verbosity bias. Longer responses get rated higher regardless of actual quality. The judges confuse thoroughness with correctness.

Prompt sensitivity. Small changes to how you phrase the judging criteria cause large, unpredictable scoring swings.

Failure on subtle errors. LLM judges consistently miss the kind of logic errors that human experts catch easily.

These weren't edge cases or cherry-picked failures. They were systematic, reproducible problems that emerged across multiple research groups studying judge reliability.

The implications were sobering. While vanilla LLM-as-a-judge works fine for cheap filtering and initial screening, it cannot replace human expert verification for high-stakes evaluation. Generic metrics like "helpfulness" or "coherence" are easy to generate but hard to trust. Most companies are still using vanilla LLM judges with generic metrics. The sophisticated ones are building something fundamentally different.

Abstract visualization of building evaluation infrastructure

Building Evaluation That Matches Reality

The "Impossible" Tests

To counter benchmark saturation, researchers in 2025 invented tests that were effectively impossible for existing models.

Humanity's Last Exam (HLE), released by the Center for AI Safety and Scale AI in January 2025, represented a radical new methodology. The benchmark was explicitly designed as the "final closed-ended academic benchmark." It contains 2,500 expert-level questions across mathematics, science, and humanities that stumped GPT-4o and Claude 3.5 during development. If a leading model could answer a question correctly during the design phase, that question was rejected.

The results provided a reality check for the industry. While models had achieved 80%+ on MMLU and other undergraduate-level benchmarks, the best models in 2025 scored only 30-35% on HLE. The gap between passing standardized tests and genuine expert-level synthesis remained vast.

FrontierMath took this even further. Released by Epoch AI in November 2024, this benchmark consists of research-grade mathematics problems where frontier models initially scored below 2%. Yet by 2025, advanced reasoning models were already reaching 25-30%, proving that even expert-level mathematics couldn't hold the line for long.

Both benchmarks share a common methodology: collaborative design with domain experts, negative filtering against existing models, and emphasis on synthesis over retrieval. They represent a temporary victory in the benchmark arms race, though history suggests models will eventually learn to game these too.

From Models to Systems

But the biggest shift in 2025 evaluation wasn't about harder static tests. It was about evaluating systems that take actions.

SWE-bench Verified became the gold standard for measuring autonomous coding agents. This refined version of the original SWE-bench contains hundreds of carefully validated GitHub issues, filtered to remove ambiguous or impossible tasks. Performance improved dramatically from around 30% in mid-2024 to around 75% by late 2025, with leading models continuing to close the gap.

This dramatic improvement represented genuine progress. But it also revealed the core challenge of agentic evaluation. Agents optimize for the measurement, not the intent. The git history exploit proved this definitively. Agents found unintended shortcuts to maximize their scores rather than developing the underlying capability the benchmark was designed to test.

By year's end, agent evaluation had expanded beyond coding to browser automation, GUI interaction, and multi-tool workflows. The fundamental challenge remained consistent: measuring system behavior over time (tool use patterns, recovery from dead ends, partial success versus catastrophic failure), not just binary task completion.

We're no longer evaluating text generation. We're evaluating autonomous decision-making in complex, adversarial environments. This requires fundamentally different evaluation approaches.

Process Reward Models (PRMs) emerged as one response to this challenge. Instead of just checking whether the final answer is correct, frameworks like ProcessBench evaluate every step of the reasoning chain. This helps catch "lucky guesses" where a model arrives at the right answer through completely wrong reasoning that just happens to work on this particular problem.

The Multimodal Gap

As models became natively multimodal in 2025, text-based evaluation became insufficient for real-world applications.

Companies deploying AI for document processing, financial analysis, and visual QA discovered that text benchmarks don't predict performance on actual workflows. Forms with complex layouts, charts requiring visual reasoning, diagrams mixing text and images. These dominate enterprise use cases, but standard benchmarks miss them entirely.

The academic community faced similar challenges. Video-MME, featured at CVPR 2025, exposed that models excel at describing static images but struggle with temporal consistency and causal reasoning over time. Performance drops significantly as video duration increases from seconds to minutes to hours, revealing that current models lack genuine temporal understanding.

The insight applies across modalities. You can't grade a vision model with a multiple-choice text quiz. As capabilities expand beyond text, evaluation must match the modality of deployment. This became a first-order problem for both academic benchmarks and production systems in 2025.

When Benchmarks Don't Match Reality

From Public Leaderboards to Custom Infrastructure

While academics built harder benchmarks, practitioners faced a different problem. Public leaderboards stopped correlating with production success.

Throughout 2025, leading AI organizations made a fundamental shift. They stopped relying on public benchmarks and started building custom, internal evaluation infrastructure. Platforms emerged to support this (LangSmith, Arize Phoenix, Weights & Biases Weave, and more), but the key insight transcended any specific tool.

Evaluation needed to become part of the development loop, not a separate research exercise.

The principles that mattered included custom evaluation sets built from production data and actual failure modes rather than synthetic academic benchmarks. CI/CD integration that automatically blocks deployments when quality regressions are detected. Rigorous versioning to trace every performance change back to specific prompt edits, model updates, or data shifts. Semantic intent measurement that asks whether the system achieved the user's underlying goal, not just whether it answered the literal question.

Safety and Operational Metrics

For high-stakes and regulated deployments, evaluation expanded beyond capability testing to include safety and behavioral criteria. Safety evaluation operationalized from ad hoc red-teaming into repeatable, quantitative measurement integrated directly into deployment pipelines.

For production deployments, traditional capability metrics became secondary to operational metrics. Time to first token matters more than MMLU score for customer-facing applications. Cost per request often determines build-versus-buy decisions. Throughput limits what's actually feasible for batch processing workloads. Reliability (the ability to consistently produce valid outputs or fail gracefully) beats benchmark performance every time.

The insight became obvious once stated. The "best" model on MMLU is usually not the best model for your specific business problem. Custom evaluation against your data and your failure modes beats public benchmarks every single time.

What 2025 Taught Us

2025 taught us that we can't automate the definition of "truth" or "intelligence." Every breakthrough in evaluation (HLE, FrontierMath, SWE-bench Verified) represents a temporary win in an adversarial game. Models will eventually learn to game these benchmarks too. The treadmill will continue.

As someone building evaluation infrastructure at Goodeye Labs, where we're putting domain experts in control of AI quality without requiring them to code, the most important realization of 2025 wasn't that benchmarks saturated. It's that we were measuring the wrong things.

When we talk with customers deploying AI in production, they ask "which benchmark should we use?" The honest answer is probably none of the public ones. The benchmark you need probably doesn't exist yet. You have to build it from your production data, your actual failure modes, and your specific business requirements and constraints.

This is where theory meets practice. Companies shipping AI products in 2025 discovered that generic "helpfulness" scores don't tell you if your customer service bot is giving accurate refund policies. Passing HumanEval doesn't mean your code agent won't introduce security vulnerabilities in your codebase. A high MMLU score doesn't predict whether your RAG system will hallucinate on your specific document types.

In 2025, we learned that measuring AI is as hard as building AI. The smartest model needs the smartest test, and both are moving targets in an adversarial game where the models are learning to optimize for whatever metrics we create.

Abstract visualization of future AI evaluation directions

What to Build in 2026

The winning organizations in 2026 won't be those with the highest MMLU scores. They'll be the ones who build evaluation systems that reflect their reality, testing on their data, their edge cases, their actual user queries rather than academic proxies. They'll adopt adversarial evaluation mindsets, treating models as potential adversaries who will game whatever metrics you give them rather than cooperative students trying to learn. They'll go beyond vanilla LLM-as-a-judge, building sophisticated evaluation pipelines that combine automated scale with domain expertise, calibration mechanisms, and adversarial robustness. And they'll shift from "general intelligence" to "specific capability" verification, measuring what the system can actually do on real tasks rather than how "smart" it appears on synthetic tests.

The future of evaluation isn't building harder public benchmarks, though we need those too. It's building the expertise and infrastructure for every organization to create, maintain, and iterate on their own evaluation frameworks tailored to their specific needs.

The era of "trust the leaderboard" is over. The era of "trust your own measurements" has begun.

The only winning move is to measure what matters to you, with data that represents your world, and be prepared to update your measurements as quickly as the models evolve.

Interested in how we're solving these challenges?

At Goodeye Labs, we're building evaluation infrastructure that lets domain experts assess AI quality in minutes, with no coding required. Learn more about our approach and join teams already on our waitlist.

Visit Goodeye Labs and join the waitlist →
Randy Olson

Randy Olson, PhD

Co-Founder & CTO at Goodeye Labs

Dr. Randy Olson builds evaluation infrastructure that puts domain experts in control of AI quality. He holds a PhD in Computer Science from Michigan State University and has spent over 15 years building production AI systems across research and industry.