This website uses cookies

Read our Privacy policy and Terms of use for more information.

In this post, I am sharing my initial insights from my side project to build a RAG-based knowledge assistant. You can read more about what I am doing here: Why I am building the RAG by myself?

I was a few hours into matching the questions from the golden set of truth to the right answers when Rule 5.6 of the Golf Rules stopped me. Ironically, the rule on slow play has also slowed my side project to build a RAG-based golf rules AI assistant. 

LangChain’s RecursiveCharacterTextSplitter, which I used to prepare documents for a RAG system, cut them in half, doing exactly what I'd told it to: chop the text into chunks of about 500 tokens. The penalty for the breach, the full ladder from one stroke to disqualification, ended up in one chunk. The conditions, the allowed reasons to pause, and the heading with the rule's name on it ended up in the chunk before. A 50-token overlap joined the two with a single shared line. That was the whole bridge between the two halves of one rule.

So I sat there with a question I hadn't expected on a Saturday afternoon. When I mark the correct answer to "what happens if I play too slowly," which chunk do I choose? The one that names the rule but stops before the real penalty, or the one that holds the penalty but never says what it's for? And if my retrieval system grabs the chunk with the heading and feels like it found the rule, has it found the answer, or just the title?

If you read the last post, you know what I'm doing here. I'm building a small RAG assistant for the rules of golf, not to become an AI engineer, but to build a detector: the instinct to look at a confident quality number and know what lies beneath it. I expected that instinct to come from building the system. It came from somewhere I wasn't looking.

Before I get to where, a minute on how the thing works, because the rest depends on it.

How a RAG actually works

RAG stands for retrieval-augmented generation, and the idea is simpler than the name. A language model, on its own, answers based on what it absorbed in training. It doesn't know your documents, and if you ask about them anyway, it will often produce something fluent and wrong. RAG fixes that by handing the model the relevant parts of your documents at the moment you ask.

It runs in two moves. First, it retrieves: when a question comes in, the system searches a knowledge base built from your documents and pulls back a handful of pieces that most closely match the question. Then it generates: it gives those pieces to the language model along with your question, and the model writes an answer based on what it was given.

The model doesn't know your rules. It looks them up first, then answers based on what it found.

One thing in that picture matters more than the rest. The answer is only ever as good as the pieces of the retrieval handed over. If the right piece isn't in that handful, the model fills the gap confidently. So the quality of a RAG system rests, to a large degree, on the quality of retrieval. And the quality of retrieval rests on something even less glamorous: how the documents were cut up in the first place.

Chunking is where the trouble starts

A knowledge base can't be searched as one long document. It has to be broken into smaller pieces because retrieval works by finding the pieces closest in meaning to your question. So the system slices each document into chunks of roughly equal size, a few hundred words each. It does this by length, not by sense. It counts to about 500 tokens and cuts, wherever that lands.

Documents are sliced into fixed-size pieces by length, not by meaning. A thin overlap repeats a line or two between neighbours.

To soften the cuts, you add a small overlap so each chunk repeats the last line or two of the previous one. The intent is to avoid slicing a thought clean down the middle. In practice, the overlap is thin, and it rescues far less than you'd hope. Rule 5.6 is where I first felt that, but I'm getting ahead of myself.

 

Note: Of course, I am aware that there are more sophisticated ways to chunk, but I kept it fixed-length for the sake of my experiment. For educational purposes, it worked very well. 

I thought the eval was the easy part

The pipeline came together in an afternoon. PDF in, text out, split into chunks, turn each chunk into numbers that capture its meaning, load them into a store you can search, write a function that takes a question and returns the closest pieces. None of it was hard with Claude Code's guidance.  The tools are mature, the tutorials are good, and the work that used to take a team a quarter now takes one person an evening. I finished, almost suspicious of how smoothly it had gone.

I assumed the evaluation would be the quick follow-on. A script you point at the system. Ask it some questions, check the answers, and read off a score. That was my mental model, and I suspect it's the one in most rooms where someone presents an RAG accuracy number.

Then I tried to build the thing the script points at, and the afternoon turned into a week.

The part nobody photographs

To measure retrieval honestly, you need a set of questions and, for each one, the pieces of the knowledge base that actually contain the answer. The questions I could write. I sat down and wrote nineteen, the kind of real golfer asks, in the language a real golfer uses. That took an afternoon.

The other half is where the week went. For each question, I had to go into the knowledge base, decide which chunks were the correct answers, and record them. This is the ground truth. It's the ruler you measure everything else against, and there's no clever shortcut to it. You read, you judge, you tag. A few thousand chunks, nineteen times, with the standard held steady, so question three is judged the same way as question seventeen. And the judgment turned out to be full of decisions nobody writes down, decisions that quietly set the number before a single query has run.

What does the "relevant" even mean

Take a plain question: “Can I lift my ball to clean it?” There's a chunk that states the rule. Tag it, done. Except there's a neighbouring chunk holding the exception, the case where you can't, the "only if" a human would absolutely want before giving real advice.

Do I tag only the chunk that says yes? Or also the chunk that says yes, but? A person answering would read both and combine them. A retrieval system that returns only the direct-answer chunk is doing something narrower than a human does. If I tag only that one chunk as correct, I'm rewarding the narrower replies.

There's no right answer here. It depends on what you want the system to be. A fact-surfer that hands back the rule, or a reasoning aid that hands back everything you'd need to reason. The point that landed for me is that this choice, made nineteen times by one tired person, moves the score.

The difference isn't minor. Tagging narrowly, focusing only on the specific rule chunk, causes a system that also retrieves surrounding context to seem noisy and then perform worse. In contrast, tagging broadly, including every chunk a human would consider relevant, makes the same system appear comprehensive. With the same retrieval process and questions, a single labelling decision — unseen by others — can significantly impact the result. When I quantified how these different construction choices affected the score, the gap was substantial enough to alter the story the dashboard presents. Most of what I discuss in the next post revolves around this point.

The chunk that's a hit and a miss at once

The split I described isn't a freak event; it's a structural feature of cutting documents by length. Headings drift away from the content they head. Penalties are separate from the conditions that trigger them. The word you'd search for and the answer you actually need land on opposite sides of a boundary, joined by one line of overlap that helps less than you'd think.

Ask "what's the penalty?" and the full answer sits in the piece that never names the rule. 

So I'd find a chunk that clearly belonged to a question, and a second chunk beside it holding the half of the answer the first one was missing. Tag one, and the system can appear to have succeeded while returning something incomplete. Tag both, and you've quietly redefined success again. The retrieval system, meanwhile, has no idea any of this is happening. It finds the words that match and returns them, indifferent to whether the rule it grabbed is whole.

Sitting with the knowledge base, chunk by chunk, did something no dashboard could. It showed me my own pipeline. I found boundaries that cut rules in clumsy places. I found chunks that looked relevant to a question, shared its vocabulary, and answered nothing. I found answers that were incomplete without the chunk beside them, which meant my 500-token setting had been making decisions about meaning I'd never consciously approved. I learned more about what was wrong with my retrieval by labelling the test set than I would have from any score the test set could have produced.

 

Building the ruler was itself the inspection. And it's exactly the step teams skip, because it's slow and tempting to automate. You can have a model generate your test questions and even guess at the answers, and it'll hand you a tidy dataset in minutes. But you'll have skipped the week when you actually look at your own data. You'll have a number and no relationship to what's inside it.

 

Why am I doing this on my own

A RAG accuracy figure isn't a literal measurement like a thermometer reading temperature. Instead, it reflects choices made during its creation—such as what is considered relevant, whether partial answers count, how documents are sliced, and which questions are included. Changing any of these choices alters the score, even if the underlying system remains unchanged. If you made those choices, you know them; if not, the number appears clean and final but offers no insight into the decision process.

My goal wasn't to learn this, but to obtain a number. What I gained instead was a list of questions I now ask anyone presenting me with one.

For example, how do you determine what counts as a correct answer? Do the test questions come from your own documents or from an external source? Show me five instances where the system marked answers as correct, and let me see what it actually retrieved.

I believe doing the work yourself is valuable because it helps you truly understand the process. Teaching forces you to articulate what you know and reveals gaps in your understanding. Building the system does the same, pushing you to defend the number you generate before others see it. As more tasks are automated with AI, this aspect remains vital. An AI can build pipelines, run evaluations, and format results, but it can't judge whether those results are trustworthy. That judgment belongs to a person who has been directly involved in creating the number.

There is no better way to discover what influences the quality of RAG solutions than by building one yourself. The arrival of Claude Code and friends has made this option available to virtually every product manager, not only those who used to be software engineers. 

I'm still out on the course with my RAG Golf Experiment. Whether the Rule 5.6 split actually cost me at retrieval, or whether the overlap saved it after all, is the topic of the post I'm writing next. 

Wojciech Pozarzycki, June 2026

You may also like: