How are you measuring your AI agents?

We've been building AI support agents for a while now, and we kept hitting the same wall with standard RAG implementations: The Pizza Problem.

You slice a document (pizza) into arbitrary chunks and hope the retrieval system grabs the right slice. But often, it grabs half a mushroom and some unrelated crust. The real issue isn't just bad answers—it's that you can't measure the accuracy of a random slice. If the retrieval is "mostly" correct, how do you score that?

We decided to completely tear down our architecture for Answerly v3. We stopped splitting documents into chunks and started converting them into specific Q&A pairs via LLM. Now, we can actually calculate "Knowledge Coverage" and "Hit Rates" because the data units are distinct.

I wrote a technical breakdown of why we made this shift and how the "Pizza Analogy" drove our engineering decisions:

https://blog.buywhitelabel.com/customer-support-with-ai-in-2026-accuracy-is-all-that-matters/

I’m curious—for those of you building RAG systems, are you sticking with chunking, or have you found a better way to sanitize the input data for predictable accuracy?

10 views

How are you measuring your AI agents?

Replies