RAG from scratch series - part 1

In this blog series, we are going to build a RAG system from scratch. The main goal is to familiarize yourself with foundational concepts and which pieces we need to put together to build such a system.

One of the best ways of learning a complex topic is to break it down into pieces and assemble them into a functional system and that's exactly what we are going to do.

While there are many libraries and ready-made solutions that achieve the same they also hide the underlying details which makes it difficult to learn the important concepts so we will keep our system pretty much barebones.

Refresher - what is a RAG?

From our beginner's glossary RAG stands for Retrieval Augmented Generation. The naming might sound complicated but the goal of RAG systems is simple, it's a way to supplement a large language model with additional information. Why would anyone want to do that?

Here is the deal - large language models (commonly known as LLMs) are trained on massive amounts of data for example all Wikipedia articles. This means that an LLM "knows" about this information which is great because we get to ask questions about it or can instruct the model to do something with that data.

However, there is a downside - what about the things that a LLM doesn't know? There is a lot of data out there that an LLM wasn't trained on for example your emails, your insurance policies, company HR guidelines, etc.

This is where RAG comes in, it's a powerful way to feed all this data into an LLM and still leverage all the LLM capabilities such as summarization, question answering, etc. While RAG sounds technical a friendlier term emerged in the business circles - Ask your documents. Hopefully, this gives you an idea of what these systems can achieve.

What are we going to build?

Gitlab is a well-known company that offers a DevOps platform that is used by many software development teams around the world. Besides their great product, they are also famous for their company culture. Being a fully remote company, transparency and access to information for all employees is essential so Gitlab decided to make their company policies and processes publicly available in their Gitlab Handbook.

Some documents from the handbook include onboarding/offboarding of employees, how to invoice, how Gitlab does product marketing, etc. Exactly what we need for our RAG system!

At the end of this series, our system should be able to perform the following:

We ask a question about Gitlab policies e.g. how do I offboard an employee?
Our system will find 2-3 documents that are relevant to the offboarding process (Retrieval)
Then it will combine the question with the text of these 2-3 documents and create one big prompt (Augmentation)
The large language model will respond to our question (Generation)

Code and project setup

All the code is located in my Github repository. Please follow the instructions in the Readme.MD to get everything up and running.

Our Gitlab documents are located inside the handbook folder. They are in Markdown format and each folder represents a category. There are about 130 documents which is much smaller compared to the original handbook because for cost and speed reasons I've selected a smaller subset of the documents. However, this is already more than enough to build a RAG system. Let's dive in.

Getting chunky

Chunking is a process in which we break down a bigger document into smaller pieces. There are many chunking strategies and for our system, we are going to stick to simple fixed-size chunks. In other words, we are going to break down the text into smaller pieces of 750 characters each.

You are also probably wondering - why do we even have to break down the document into smaller parts? Hold that thought, we will reach the answer in a later stage.

Choosing a chunking strategy has a direct impact on the performance of your system and the quality of answers you get. The chunking techniques range from simple to more complex ones such as semantic chunking but the important part is that there is no one-size-fits-all approach. It's an iterative process and highly depends on your documents so keep exploring until you get the best results.

Inside the chunk script, we see a couple of things happening. First off, we read all markdown files from the handbook folder, then slowly iterate through each. If we look at one of the documents, we can see that there is a title and description on the top.

Such information will help the LLM with finding the right document later, so we want to store it for each chunk. This is the job of extract_document_metadata function. Lastly, we break down the document into chunks specified by CHUNK_SIZE and store them in json.

def chunk_documents():
    documents = gather_handbook_documents()

    for document in documents:
        with open(document) as d:
            document_text = d.read()
            title, description, remaining_text = extract_document_metadata(document_text)

            for chunk_index, chunk in enumerate(textwrap.wrap(remaining_text, CHUNK_SIZE), start=1):
                create_file_for_each_chunk(title, description, document, chunk_index, chunk)

Run the following command poetry run python3 chunk.py. Inside the chunks folder you'll find files such as this one below. Congrats, you've successfully chunked the documents!💪


{
    "id": "710014b9-11fb-4754-9611-d6bbd49d85d4",
    "title": "CEO",
    "description": "This page details processes specific to Sid, CEO of GitLab.",
    "document": "handbook/ceo/_index.md",
    "chunk_text": "---  ## Intro  This page details processes specific to Sid, CEO of GitLab. The page is intended to be helpful, feel free to deviate from it and update this page if you think it makes sense. If there are things that might seem pretentious or overbearing please raise them so we can remove or adapt them. Many items on this page are a guidelines for our [Executive Business Administrators](/job-families/people-group/executive-business-administrator/) (EBAs).  ### CEO Bio  Sid Sijbrandij is the Co-founder, Chief Executive Officer and Board Chair of GitLab Inc., the most comprehensive AI-powered DevSecOps platform. GitLab\u2019s single application helps organizations deliver software faster and more efficiently while strengthening their security and",
    "chunk_token_count": 153
}

Side note on the chunk_token_count field. Tokens are the currency of the GenAI world because we get charged by the number of tokens we send/receive from an LLM. Our system is small and the token amount is negligible so we will not use this field, but for larger systems, it's good to have it around to track the costs.

From chunks to embeddings

Since now we have lots of smaller chunks we also need a way to efficiently search them during the retrieval phase. Remember, our system is going to get a question about any of the documents so how we can find the chunks that are relevant and related to the user question? The answer is embeddings.

Embeddings are a way to convert text into a long list of numbers. These numbers are special because they don't only capture words but also meanings and relationships between words. These numbers are stored as points in a multidimensional space. If you are a visual person, check out Embedding project to see what this looks like.

In our example, chunks that are related to interviewing at Gitlab will be placed closer to each other, same for chunks that are related to leadership they will be close to each other, and so on. All large language models are built on top of embeddings, it's one of the components that makes them so powerful.

Embeddings are also referred to as vectors.

Now it's time to generate embeddings from our chunks and we can see how they look like in our example project. The code is located in the embeddings script. For each chunk, we call the OpenAI Embeddings API and pass chunk_text to it. We then store the retrieved embeddings into a new field embeddings and write everything to the same json file.

Heads up before you run this script! There are around 3000 chunks so this will take around 15-20 minutes to complete, grab some coffee.☕

chunk_files = gather_chunk_files()

for index, chunk_file in enumerate(chunk_files, start=1):
    chunk_data = json.load(open(chunk_file))

    with open(chunk_file, 'w') as c:
        response = openai_api_client.embeddings.create(input=chunk_data['chunk_text'], model=os.environ.get('EMBEDDING_MODEL'))

        chunk_data['embeddings'] = response.data[0].embedding
        json.dump(chunk_data, c, indent=4)

    time.sleep(0.1)
    print(f'Processed chunks -> {index}/{len(chunk_files)}')

Run poetry run python3 embeddings.py. After it completes, open any chunk file inside the chunks folder, you should see a new field called embeddings which is a very, very long list of numbers.

In a real project, both chunks and embeddings would be stored in a database or specialized databases such as vector databases. The goal of this project is to learn the ropes so we are intentionally keeping things simple but it's also important to keep in mind that our project can be massively improved which we are going to tackle in some future blog posts.

Part 1 - complete

Phew, that was a long post but no one said it's going to be simple, right?😉 We did achieve a massive thing, our data is now ready to be consumed and we get to build the fun parts. We also learned some new concepts such as chunking and embeddings and in the next part, we will use these to turn them into a functional system.

Stay tuned for the second part of this post. In case you have questions, please reach out via Linkedin or open a Github issue if you have problems with running the repository.