AI evals are dead. Long live AI evals.
The funny thing about building an AI product right now is that the hard part keeps changing.
A couple years ago, I was obsessed with offline evals. They felt clean. You write a test, you run it every time you change something, and you get a number you can trust. If the number goes up, you ship. If it goes down, you fix it. It’s the kind of engineering loop that makes you feel like you’re in control.
And for a while, it worked.
Then, models got good.
Good enough that most of the evals you can write in a normal amount of time get saturated almost immediately. You spend half a week crafting a tricky eval, you finally feel proud of it, you run the new model, and it aces it. Great, right? Except now the eval isn’t measuring anything. It’s a speed bump the model already learned to drive over.
Take our AI data analyst at @Basedash .
There was a time when the main failure mode was technical. Bad joins, wrong filters, hallucinated columns, slightly-off date logic that nobody notices. If you’re building anything that touches business data, those mistakes are brutal because they destroy trust fast. So we treated technical correctness like the whole game. If we could get text-to-SQL right, everything else would follow.
Now, barely a year later, text-to-SQL is a solved problem.
Our AI analyst almost never makes technical mistakes anymore. It generates clean SQL. It respects schemas. It handles edge cases that used to be a minefield.
And yet, the product still isn’t “done.” Not even close.
Because the stuff that’s left doesn’t show up in the old evals.
The real challenges are squishier now. They’re about taste. Context. Judgment. Whether the AI is actually aligned with what you care about, not just what you asked for.
Here are three big areas that keep coming up for us.
1. Does the AI know what we care about as a business?
Two companies can look at the same dataset and want totally different answers. One team cares about retention by cohort. Another cares about gross margin by product line. Another cares about sales cycle length by segment. The SQL queries can be perfect and still miss the point because they're optimizing for a generic idea of “analysis,” not your idea of what actually matters.
This is the kind of problem that’s hard to even describe without sounding vague. But you feel it instantly when you use the tool. You can tell when the AI is pulling on the thread you would pull on, versus when it’s producing something that’s technically correct but functionally irrelevant.
2. Does it share insights I hadn’t considered before?
This is where things get uncomfortable, because it’s not just “did it answer the question.” It’s “did it help me think.”
A lot of AI products today feel like competent interns. They do exactly what you asked, quickly and competently, and then they stop. That’s useful, but it’s not what makes you lean back and go, oh, that’s interesting.
The best moments are when it notices something you didn’t prompt it for. A weird spike that lines up with a code change. A segment behaving differently than the rest. A metric that looks healthy until you drill down one level deeper.
But you can’t evaluate that with a neat answer key. There is no single “correct” surprise.
3. Does it present visualizations in ways I find compelling?
How you present an insight changes whether a human believes it, remembers it, and acts on it.
A chart can be technically accurate and still feel wrong. Wrong scale, wrong framing, wrong comparison, wrong default time window, wrong granularity. Or it can be correct but boring in a way that makes you move on too quickly. Good visualization is opinionated. It’s a tiny act of storytelling. And storytelling is hard to score.
So, where does that leave evals?
For me, it means the job shifted from “can the model do the task” to “does the model do the task the way we want.” That difference is subtle, but it changes everything about how you measure progress.
It pushes you toward evals that are:
grounded in real usage, not synthetic prompts
tied to whether users trust and adopt outputs
sensitive to preference, not just correctness
Sometimes that looks like human judgment. Sometimes it looks like thumbs-up signals from real users. Sometimes it looks like comparing two outputs side-by-side and asking which one you’d actually ship in the product.
It also means you spend more time on context. Not just “what’s the schema,” but “what does this company care about,” “what does this person usually look at,” “what decisions are they trying to make.” The model can be brilliant and still fail if it doesn’t have that.
In a way, I find this shift encouraging.
When the main problems were technical, the work was mostly about getting the model to stop being wrong. Now the work is about getting it to be interesting. Helpful. A little bit opinionated in the right way. More like a great analyst you’d actually want on your team, not just a friendly SQL compiler.
If your evals are saturating, that’s not a sign you’re done. It’s a sign your measuring stick is stuck in the old world. The next iteration isn’t harder tests. It’s taste, context, and insight. That’s where the real product work remains.


Replies