Sitemap

Why AI Evals Are the New Unit Tests: The Quality Assurance Revolution in GenAI

6 min readJun 11, 2025

--

How evaluation systems became the most critical skill for building reliable AI products

I remember the first time I deployed a machine learning model to production. It was 2019, and after weeks of training and validation, our model achieved 94% accuracy on the test set. We celebrated, pushed to production, and then watched in horror as user complaints started flooding in within hours.

The model was technically correct, but it was failing in ways our traditional testing never caught. Users were getting recommendations that were accurate but irrelevant. The system was working exactly as designed, but it wasn’t working for real people in real situations.

That experience taught me something that the AI industry is now learning at scale: traditional software testing breaks down completely when you’re dealing with large language models. And that’s why evaluation systems, or “evals,” have become the most important skill in AI development.

The Death of Deterministic Testing

Traditional software development operates in a predictable world. Write a function that adds two numbers, and you know exactly what to expect. Test it with different inputs, check the outputs, and you’re done. Same input always equals same output. Pass or fail. Simple.

But LLMs shattered this comfortable predictability.

Ask GPT-4 to write a product description twice, and you’ll get two different responses. Both might be good, but they’re different. How do you test that? How do you ensure quality when the very definition of “correct” becomes fluid?

“Most overlooked skill in machine learning is creating evals,” Greg Brockman, President of OpenAI, recently observed. He’s not wrong. While everyone focuses on prompt engineering and model selection, the teams that win are the ones that figure out how to measure and maintain quality consistently.

This is why evals have become the new frontier. They’re not just nice-to-have testing additions. They’re the fundamental infrastructure that makes AI products reliable enough for real-world use.

Why Every AI Leader Is Talking About Evals

The shift in focus from model capabilities to evaluation systems isn’t coincidental. It reflects a maturing understanding of what actually matters when building AI products that people can depend on.

Garry Tan, CEO of Y Combinator, puts it bluntly: “Evals are emerging as the real moat for AI startups.” Not the latest model, not the cleverest prompts, but the ability to consistently measure and improve quality.

Logan Kilpatrick from Google AI Studio goes even further, calling better evals “one of the most important problems of our lifetime, and critical for continual progress.”

Why such strong language? Because without reliable evaluation systems, we’re essentially flying blind. We might build something that works well in demos but fails catastrophically with real users.

The Unique Challenges of LLM Testing

Traditional software testing assumes a world where systems behave predictably. LLMs operate in a fundamentally different reality:

Non-deterministic outputs: The same prompt can generate dozens of valid responses. Your eval system needs to assess quality across this spectrum of possibilities, not just check for exact matches.

Graded performance: Instead of binary pass/fail, LLM outputs exist on a spectrum from excellent to terrible, with a lot of nuanced middle ground. Your evaluation needs to capture these gradations.

Slow feedback loops: Running comprehensive evals can take minutes or hours, not milliseconds. This changes how you think about testing cycles and continuous integration.

Debugging complexity: When a traditional function fails, you get a stack trace pointing to line 47. When an LLM gives a bad response, you’re left wondering about training data, prompt context, and model reasoning processes you can’t directly observe.

These challenges mean that the testing approaches that work for conventional software simply don’t translate to AI systems.

What LLMs Actually Excel At (And Where They Fall Short)

Understanding the specific strengths and weaknesses of large language models is crucial for building effective evaluation systems.

LLMs are genuinely impressive at generating fluent, coherent language. They can handle translation, summarization, and question-answering tasks that seemed impossible just a few years ago. They’re remarkably good at generalizing from examples and adapting their output style to match your requirements.

But they also have consistent failure modes that every AI product team needs to account for:

Consistency issues: Minor changes in prompts can lead to dramatically different outputs. A successful eval system needs to test for this variability and ensure acceptable performance across prompt variations.

Truth versus plausibility: LLMs are trained to generate text that sounds convincing, not necessarily text that’s factually accurate. They’ll confidently state incorrect information if it fits the pattern they’ve learned.

Reasoning limitations: While impressive at many tasks, LLMs still struggle with multi-step logical reasoning and mathematical calculations. They can often get the right answer through pattern matching, but fail when the problem requires genuine reasoning.

Knowledge boundaries: Training data cutoffs create blind spots. More importantly, LLMs don’t know what they don’t know, so they won’t reliably indicate uncertainty about information outside their training.

“The goal isn’t to eliminate these limitations, it’s to build systems that account for them systematically,” one AI engineering leader told me recently. That’s exactly what good eval systems do.

Building Evaluation Systems That Actually Work

The most effective evaluation systems I’ve seen follow a structured approach that starts with real-world failure analysis rather than theoretical metrics.

The process begins with deep analysis of actual system performance. This means studying around 100 real user interactions to identify specific failure modes. Not generic problems like “sometimes gives wrong answers,” but specific patterns like “recommends expired products when users ask about current availability” or “switches to overly formal tone when users make casual requests.”

This analysis phase is crucial because it grounds your evaluation system in reality rather than assumptions about what might go wrong.

The next step involves building automated evaluators that can catch these specific failure modes at scale. The key insight here is to prefer code-based checks over LLM-as-judge approaches whenever possible. If you can write a deterministic test that checks whether a response contains required information, that’s often more reliable than asking another LLM to grade the response.

Finally, the system needs a continuous improvement loop. Fix the highest-impact failures first, then cycle back to analyze new failure modes that emerge as your system evolves.

The Strategic Importance of Getting This Right

What makes evals particularly important right now is that they’re becoming a competitive differentiator. Companies that master evaluation systems can iterate faster, ship more reliable products, and build user trust more effectively than those that don’t.

I’ve watched AI startups with impressive demos fail to gain traction because their products were inconsistent in real-world use. Meanwhile, companies with less flashy capabilities but robust evaluation systems steadily build user bases and expand their market presence.

The reason is simple: users don’t care how impressive your model is in perfect conditions. They care whether it works reliably when they need it to work.

Looking Forward

As AI capabilities continue to advance, the importance of evaluation systems will only grow. We’re moving from a world where having access to capable models was the constraint to a world where using them reliably is the challenge.

The teams that figure out evaluation systems first will have a significant advantage. Not just in building better products, but in building products that users can actually depend on.

This isn’t just a technical challenge, it’s a fundamental shift in how we think about software quality. We’re learning to measure and maintain reliability in systems that are inherently probabilistic and contextual.

The question isn’t whether your AI product needs robust evaluation systems. The question is whether you’ll build them before your competitors do, or after your users lose trust in your inconsistent outputs.

  1. Hamel Husain & Shreya Shankar created the most practical course.

Get it here (you get $800 off with this link).

2. Check out our deep dive in the newsletter.

How are you thinking about evaluation systems in your AI products? And more importantly, what failure modes are you not catching yet?

--

--

Aakash Gupta
Aakash Gupta

Written by Aakash Gupta

Helping PMs, product leaders, and product aspirants succeed

No responses yet