Evals: The Product Manager's New Quality Gate

The old quality gate quietly broke

For twenty years, product quality had a clean definition. You wrote acceptance criteria, engineering built to them, QA checked them, and a feature either passed or it didn't. Same input, same output, every time. The gate was binary and the PM held the key.

AI features broke that gate without announcing it. An LLM-powered feature is probabilistic: the same prompt can return a great answer, a mediocre one, and a subtly wrong one on three consecutive runs. There is no single "correct output" to assert against. "It looked great in the demo" is not a quality bar — it's a survivorship-biased anecdote. And yet teams keep shipping on exactly that, because the muscle for measuring probabilistic quality was never built.

This is the gap evals fill. An eval is a repeatable test of quality: a set of representative inputs, a definition of what a good response looks like, and a score you can track release over release. It is the answer to the only two questions that matter before you ship an AI feature — is it good enough? and is it getting better or worse?

What an eval actually is

Strip away the tooling and an eval has three parts. First, a dataset of real, representative inputs — including the ugly edge cases and known failure modes, not just the happy path. Second, a definition of success: a rubric that says what "good" means for this task, this user, and this business. Third, a grader that scores each output against that rubric so you get a number you can watch over time.

The grader comes in three flavours, and mature teams use all three:

1
Code-based checks: deterministic assertions — valid JSON, no banned phrases, a required field is present, the number matches the source. Cheap, fast, and where you should start.
2
LLM-as-judge: a second model scores outputs against your rubric at scale. Powerful for fuzzy qualities like tone, helpfulness, or faithfulness — but the judge itself must be validated against human labels before you trust it.
3
Human review: the gold standard for nuance and the calibration source for the other two. Expensive, so you spend it where it teaches you the most — building the labelled set your automated graders are measured against.

format_quote

"If you can't measure 'good,' you can't ship it, improve it, or defend it. Evals are how a PM makes quality legible."

— TheGlocalPM

Why this is a PM job, not an eng job

It is tempting to file evals under "engineering" or "ML." That is a mistake. Engineering can build the harness that runs the tests — but the harness is worthless without a rubric, and the rubric is a product decision. What does a good answer look like for this user in this moment? Which failures are annoying versus which are unacceptable? How do you weigh a helpful-but-slightly-wrong answer against a safe-but-useless one? Those are judgments about user value and business risk — the exact judgments a PM exists to make.

In other words: the eval rubric is the acceptance criteria of the AI era, and the PM writes it. A team where engineers quietly invent the definition of "good" because the PM didn't show up has outsourced its most important product decision to whoever wrote the test — usually optimising for what's easy to measure instead of what actually matters to the user.

What mastering evals looks like

Fluency here is a concrete, learnable skill set. In practice it means being able to:

1
Build a golden dataset from reality: curate representative inputs from real usage, and deliberately seed it with the edge cases and past failures that actually hurt users.
2
Write a rubric that encodes value, not vibes: translate "a good answer" into criteria specific enough that two different reviewers would grade the same output the same way.
3
Pick — and validate — the right grader: match code, LLM-judge, or human review to each criterion, and check your automated graders against human labels before trusting their scores.
4
Make the eval a release gate: treat a score regression like a failing unit test — it blocks the ship. A prompt tweak that lifts one case but drops five should never reach production unnoticed.
5
Close the loop: every production failure becomes a new eval case, so the same mistake can never ship twice. The eval set compounds into your most valuable product asset.

The trap: shipping on vibes

The most dangerous AI team isn't the one without a model — it's the one running on demo-driven development. Quality gets judged by a handful of cherry-picked prompts in a standup. There's no regression suite, so when a model update or a prompt change silently makes things worse, nobody notices until users do. Confidence is high and evidence is zero.

Evals are the antidote, and owning them is the modern definition of a PM owning quality. In deterministic software the PM could lean on QA and a green build. In probabilistic software, quality is a distribution you have to measure on purpose — and the PM who treats evals as optional is flying blind on the one thing that determines whether the product is trustworthy. Evals are the PRD of the AI era. Master them, and you own quality. Skip them, and you've handed the most important product decision you have to chance.

Evals: The Product Manager's Quality Gate

The old quality gate quietly broke

What an eval actually is

Why this is a PM job, not an eng job

What mastering evals looks like

The trap: shipping on vibes

Ali • TheGlocalPM

Read Next

The Future of AI in Product Management

Why the Next E-Commerce Advantage Is Inventory Intelligence