The target was 95 percent. After the first iteration, we were at 40. The gap stopped me, but only for a second. What stopped me longer was how we knew the number. It came from user feedback. That was the whole measurement.
Forty percent is not a failure. It is a starting point. But it is one I cannot do anything with. If I ask my team what to fix next, they will give me hypotheses. If I ask them whether the next iteration will move the number, they will tell me to wait for the next round of user feedback. And the round after that. And the round after that.
User feedback is the most honest signal we have. People either get the answer they came for or they don't. They either come back or they go ask a colleague. No metric is more grounded in reality. But it is the slowest signal there is, and the least diagnostic. It tells you whether the system worked, not why it didn't. I know what happens next, because I have watched it happen. Someone in the room says we just need more feedback rounds. Someone else says we should hire a survey company. Both are right, and neither helps me know what to fix on Monday.
So you reach for benchmarks. You build a test set. You run it after every change and watch the numbers move. That is how every other software delivery I have run has worked, and it is how this one will have to work if we want to ship a usable assistant before next year.
Then comes the catch. The benchmarks themselves can lie to you. I have been reading enough lately to know that the eval data sets most teams use are generated by the same kind of model that powers the chatbot. The questions look like the documents, because they were written from the documents. Retrieval looks fantastic, because the system is being asked questions it was almost designed to answer. The result is a quality score that goes up while real users keep clicking thumbs down.
I have a phrase I have used before about processes and data: crap in plus AI equals hypercrap. I wrote about it (Crap In + AI = Hypercrap). Bad inputs do not get diluted by AI; they get scaled by it. The same is true for evaluation. Bad benchmarks plus AI give you confident, repeatable nonsense, delivered in a format that looks like progress.
This is not the first time I have been in this position. In every wave I have lived through, BI, RPA, PowerApps, the same pattern has shown up: a layer of metrics that looks reliable until you open it. You find that the BI dashboard everyone trusts has two columns computed from each other. You realize the RPA bot reporting 98 percent task completion is counting half-finished work as done. You notice that the PowerApp the team built has lost most of its users by week three.
And there is always a leader, often me, who has to decide whether to trust the report and move on, or to spend a weekend understanding what is actually being measured. The leaders who got ahead in each of those waves were not the ones who read the most reports. They were the ones who, at some point, opened the box themselves.
So I am building one.
Not a production system. Nothing connected to anything that matters. A small RAG knowledge assistant on the rules of golf. I picked golf for two reasons. The first is that the rules are written in formal, precise language, and the questions a real golfer asks are written in everything except formal, precise language. That gap is the same gap any enterprise RAG faces between policy documents and the people who actually have questions about them. The second reason is more useful. I know golf. I also love it, which helps. If the system gives me a wrong answer, I will spot it in two seconds. I will not need a subject matter expert to validate the output. I will be the subject matter expert.
I am not doing this to become an AI engineer. I am doing this to develop a detector. I want to know, from the inside, what a benchmark number is actually measuring. I want to know what is easy and what is hard, what looks reliable and what is quietly broken. I want to be able to sit in a steering committee in six months and ask the one question that exposes a weak claim, because I have made and broken that claim myself.
There is a practical reason this matters now. The next conversation in most organizations is not about a single chatbot. It is about agents. About chains of AI tools doing real work, with real consequences, on top of the same retrieval foundations we are building today. The credibility I will need to fund that next step is being earned right now, on whether I can deliver this simpler thing first. If I ship a RAG that gets adopted, I get to have the agentic conversation. If I ship one that gets quietly ignored, I do not.
There is also a quieter reason, which is the one I keep coming back to. I have always believed that mentoring forces experts to verify their own knowledge. The act of teaching someone makes you check what you actually know, because you can hear yourself say it out loud. Building does the same thing in reverse. It forces leaders to verify what they have been nodding along to. The number you accepted in a slide. The architecture you approved in a diagram. The metric you cited in your last steering committee. You either know what is inside them or you don't, and the only way to find out is to put your hands on something small and see what happens.
I am going to write the next three or four posts as I work through this. Some of it will be technical, written for people in a similar seat. Some of it will be about what surprised me, including the parts where my assumptions break. I would rather find out now, on golf rules, than in two years, on something that matters.
When was the last time you sat with a number you couldn't interrogate?
Time to tee up.
Wojciech Pozarzycki, May 2026