arrow_back All Essays
AI Craft
calendar_today July 2026 timer 7 min read

Evals: The Product Manager's Quality Gate

In an AI product, you don't ship features you can fully predict — you ship behaviours. Evals are how a PM defines "good," measures it, and holds the line. They are becoming the single most important craft to master.

fact_check EVALS Define "good." Measure it. Hold the line.

The old quality gate quietly broke

For twenty years, product quality had a clean definition. You wrote acceptance criteria, engineering built to them, QA checked them, and a feature either passed or it didn't. Same input, same output, every time. The gate was binary and the PM held the key.

AI features broke that gate without announcing it. An LLM-powered feature is probabilistic: the same prompt can return a great answer, a mediocre one, and a subtly wrong one on three consecutive runs. There is no single "correct output" to assert against. "It looked great in the demo" is not a quality bar — it's a survivorship-biased anecdote. And yet teams keep shipping on exactly that, because the muscle for measuring probabilistic quality was never built.

This is the gap evals fill. An eval is a repeatable test of quality: a set of representative inputs, a definition of what a good response looks like, and a score you can track release over release. It is the answer to the only two questions that matter before you ship an AI feature — is it good enough? and is it getting better or worse?

What an eval actually is

Strip away the tooling and an eval has three parts. First, a dataset of real, representative inputs — including the ugly edge cases and known failure modes, not just the happy path. Second, a definition of success: a rubric that says what "good" means for this task, this user, and this business. Third, a grader that scores each output against that rubric so you get a number you can watch over time.

The grader comes in three flavours, and mature teams use all three:

format_quote
"If you can't measure 'good,' you can't ship it, improve it, or defend it. Evals are how a PM makes quality legible."
— TheGlocalPM

Why this is a PM job, not an eng job

It is tempting to file evals under "engineering" or "ML." That is a mistake. Engineering can build the harness that runs the tests — but the harness is worthless without a rubric, and the rubric is a product decision. What does a good answer look like for this user in this moment? Which failures are annoying versus which are unacceptable? How do you weigh a helpful-but-slightly-wrong answer against a safe-but-useless one? Those are judgments about user value and business risk — the exact judgments a PM exists to make.

In other words: the eval rubric is the acceptance criteria of the AI era, and the PM writes it. A team where engineers quietly invent the definition of "good" because the PM didn't show up has outsourced its most important product decision to whoever wrote the test — usually optimising for what's easy to measure instead of what actually matters to the user.

What mastering evals looks like

Fluency here is a concrete, learnable skill set. In practice it means being able to:

The trap: shipping on vibes

The most dangerous AI team isn't the one without a model — it's the one running on demo-driven development. Quality gets judged by a handful of cherry-picked prompts in a standup. There's no regression suite, so when a model update or a prompt change silently makes things worse, nobody notices until users do. Confidence is high and evidence is zero.

Evals are the antidote, and owning them is the modern definition of a PM owning quality. In deterministic software the PM could lean on QA and a green build. In probabilistic software, quality is a distribution you have to measure on purpose — and the PM who treats evals as optional is flying blind on the one thing that determines whether the product is trustworthy. Evals are the PRD of the AI era. Master them, and you own quality. Skip them, and you've handed the most important product decision you have to chance.

Ali — TheGlocalPM

Ali • TheGlocalPM

Senior Product Leader exploring the intersection of human intuition and artificial intelligence. Built with chaos, delivered with logic.

Read Next