RAG Evaluation: Can Your Chatbot Pass the Coffee Machine Test?

Nobody fact-checks the answer they get at the coffee machine from their work pals. You ask, you hear, you act. That's the bar your RAG chatbot needs to clear.

Think about it. When you have a quick question at work, you catch someone who knows the answer and ask. Thirty seconds later, you're done. You don't cross-reference it with documentation. You trust them because you've worked with them long enough to know they're reliable on this topic.

Now think about what happens when you ask the same question to a new colleague. Someone who just joined the team. You listen politely, but then you quietly verify the answer. You check with someone else, or you look it up yourself. Not because the new person is wrong. Because you haven't built trust yet.

This is the test your RAG chatbot needs to pass. Not an accuracy benchmark. Not a faithfulness score. The test is: will people stop checking its answers?

The metrics the industry gives you

The RAG evaluation space isn't empty. Far from it. Frameworks like RAGAS measure whether your chatbot's answers are accurate, relevant, and grounded in the right sources. Tools like DeepEval and LangSmith help you trace what went wrong when they aren't. The LLM-as-a-judge approach lets you scale evaluation without human reviewers for every query.

These tools are real and they work. If you're building a RAG chatbot, you should use them.

But here's the problem: most organizations don't. One analysis from late 2025 estimates that 70% of RAG systems still lack systematic evaluation frameworks. Not because the tools aren't available. Because the real cost of evaluation sits somewhere these tools can't reach.

What the metrics don't cover

When I got involved in the delivery of our first RAG prototype, I assumed testing would be the easy part. Build the thing, run some queries, check the answers. A few days of work, maybe a week.

I was wrong. And I'm still in the middle of it, learning just how wrong.

The first challenge before any metric applies is your knowledge base. If the documents you feed into the system are outdated, inconsistent, or poorly structured, no evaluation framework will save you. This is the old "garbage in, garbage out" principle, except with RAG it runs at scale and with confidence. A chatbot that confidently delivers wrong answers from bad documentation is worse than no chatbot at all.

Then comes the testing itself. You need realistic questions, not the ones you made up at your desk, but the ones people actually ask. That means involving the team. And team members have day jobs. Every hour they spend testing a chatbot is an hour not spent on their actual work. For a team that hasn't seen any benefits yet, this is a hard sell.

And then, even after deployment, there's the verification tax. Every time a user gets an answer and then spends five minutes checking whether it's correct, you're not saving time. You're adding a step. The chatbot was supposed to replace searching the documentation, but now the user is searching the documentation to verify the chatbot. You've made the process longer, not shorter.

So what should you actually measure?

The only benchmark that matters

I keep coming back to the coffee machine test. When someone grabs a colleague in the kitchen (or on Slack) and asks, "Hey, how do you request a business trip abroad?", they walk away and act on the answer. No verification. No second opinion. The whole interaction takes under a minute and generates immediate value.

Your chatbot competes with that. Not with a search engine. Not with reading documentation. With the easiest, fastest, most trusted way people already get answers: asking someone they trust.

This means the real evaluation metric isn't accuracy at 95% or faithfulness at 0.92. It's adoption. Whether people come back after the first week. The moment a user gets an answer and acts on it without opening the source document to double-check.

No framework measures this. But it's the only metric your stakeholders actually care about.

A pattern I've seen before

This isn't new. I've watched the same dynamic play out with BI dashboards, more than once.

The data was correct. The visualizations were clean. The reports answered real business questions. And still, when someone wanted to challenge a decision, the first move was: "I don't trust those numbers." Not because the numbers were wrong. Because questioning the tool was easier than engaging with what it said.

AI chatbots will face the same resistance. If someone in the organization doesn't want to change a process, or feels threatened by the tool, or simply prefers asking their colleague, they'll find a reason to dismiss the bot's answers. "It hallucinated once last month" becomes the excuse that kills adoption for a quarter.

❝

A chatbot with 95% accuracy that nobody uses is just expensive documentation.

A chatbot people trust enough to stop verifying is a team member.

What this means if you're planning a RAG deployment

If you're a manager planning a RAG chatbot project, here are key budgeting considerations often overlooked:

Knowledge Base Hygiene

Budget significant time and effort for cleaning and organizing your existing documentation. A chatbot will only accelerate the visibility of existing knowledge base flaws; address the mess before deployment.

Real-World Trust Assessment

Allocate weeks, not days, for user testing. Move beyond synthetic test sets and prioritize real questions from actual users. The goal is an honest assessment of whether the answers are trustworthy enough for people to use without mandatory verification.

Behavior and Adoption

Prioritize measuring usage and user behavior over simple accuracy scores. Track user retention (do people come back?). The ultimate success signal is observing the moment a user confidently acts upon a chatbot's answer without double-checking it, which indicates true trust and adoption.

And be honest about the cost curve. For the first weeks, maybe months, the chatbot will cost more time than it saves. Your team is testing, providing feedback, verifying answers, and building trust. That investment is real. If you don't plan for it, the project will look like a failure right when it's actually working as expected.

The trust threshold

There's a moment in every technology adoption when the tool stops being "that new thing" and becomes "how we do things here." With BI, it took years. With RPA, some organizations never got there.

With RAG chatbots, the moment arrives when a user asks a question, reads the answer, and goes about their day. No checking. No second-guessing. They trust it the way they trust the colleague at the coffee machine.

So before you start measuring faithfulness scores, ask yourself a simpler question: when your team has a quick question, is it easier to ask the chatbot or to walk to the coffee machine? If the answer is still the coffee machine, you know what you need to work on. And it isn't the model.

Wojciech Pozarzycki, February 2026

The Coffee Machine Test

You may also like:

Quick Links

Subscription

Socials & Contact