Retrieval-augmented generation, step by step

By Nitin Borwankar

Typically, the use of large language models (LLMs) in the enterprise falls into two broad categories. The first one is where the LLM automates a language-related task such as writing a blog post, drafting an email, or improving the grammar or tone of an email you have already drafted. Most of the time these sorts of tasks do not involve confidential company information.

The second category involves processing internal company information, such as a collection of documents (PDFs, spreadsheets, presentations, etc.) that need to be analyzed, summarized, queried, or otherwise used in a language-driven task. Such tasks include asking detailed questions about the implications of a clause in a contract, for example, or creating a visualization of sales projections for an upcoming project launch.

There are two reasons why using a publicly available LLM such as ChatGPT might not be appropriate for processing internal documents. Confidentiality is the first and obvious one. But the second reason, also important, is that the training data of a public LLM did not include your internal company information. Hence that LLM is unlikely to give useful answers when asked about that information.

Enter retrieval-augmented generation, or RAG. RAG is a technique used to augment an LLM with external data, such as your company documents, that provide the model with the knowledge and context it needs to produce accurate and useful output for your specific use case. RAG is a pragmatic and effective approach to using LLMs in the enterprise.

In this article, I’ll briefly explain how RAG works, list some examples of how RAG is being used, and provide a code example for setting up a simple RAG framework.

How retrieval-augmented generation works

As the name suggests, RAG consists of two parts—one retrieval, the other generation. But that doesn’t clarify much. It’s more useful to think of RAG as a four-step process. The first step is done once, and the other three steps are done as many times as needed.

The four steps of retrieval-augmented generation:

By focusing the LLM on the document corpus, RAG helps to ensure that the model produces relevant and accurate answers. At the same time, RAG helps to prevent arbitrary or nonsensical answers, which are commonly referred to in the literature as “hallucinations.”

From the user perspective, retrieval-augmented generation will seem no different than asking a question to any LLM with a chat interface—except that the system will know much more about the content in question and will give better answers.

The RAG process from the point of view of the user:

Use cases for retrieval-augmented generation

The use cases for RAG are varied and growing rapidly. These are just a few examples of how and where RAG is being used.

Search engines

Search engines have implemented RAG to provide more accurate and up-to-date featured snippets in their search results. Any application of LLMs that must keep up with constantly updated information is a good candidate for RAG.

Question-answering systems

RAG has been used to improve the quality of responses in question-answering systems. The retrieval-based model finds relevant passages or documents containing the answer (using similarity search), then generates a concise and relevant response based on that information.

E-commerce

RAG can be used to enhance the user experience in e-commerce by providing more relevant and personalized product recommendations. By retrieving and incorporating information about user preferences and product details, RAG can generate more accurate and helpful recommendations for customers.

Healthcare

RAG has great potential in the healthcare industry, where access to accurate and timely information is crucial. By retrieving and incorporating relevant medical knowledge from external sources, RAG can assist in providing more accurate and context-aware responses in healthcare applications. Such applications augment the information accessible by a human clinician, who ultimately makes the call and not the model.

Legal

RAG can be applied powerfully in legal scenarios, such as M&A, where complex legal documents provide context for queries, allowing rapid navigation through a maze of regulatory issues.

Introducing tokens and embeddings

Before we dive into our code example, we need to take a closer look at the document ingestion process. To be able to ingest docs into a vector database for use in RAG, we need to pre-process them as follows:

What does this mean?

A document may be PDF or HTML or some other format, and we don’t care about the markup or the format. All we want is the content—the raw text.

After extracting the text, we need to divide it into chunks, called tokens, then map these tokens to high-dimensional vectors of floating point numbers, typically 768 or 1024 in size or even larger. These vectors are called embeddings, ostensibly because we are embedding a numerical representation of a chunk of text into a vector space.

There are many ways to convert text into vector embeddings. Usually this is done using a tool called an embedding model, which can be an LLM or a standalone encoder model. In our RAG example below, we’ll use OpenAI’s embedding model.

A note about LangChain

LangChain is a framework for Python and TypeScript/JavaScript that makes it easier to build applications that are powered by language models. Essentially, LangChain allows you to chain together agents or tasks to interact with models, connect with data sources (including vector data stores), and work with your data and model responses.

LangChain is very useful for jumping into LLM exploration, but it is changing rapidly. As a result, it takes some effort to keep all the libraries in sync, especially if your application has a lot of moving parts with different Python libraries in different stages of evolution. A newer framework, LlamaIndex, also has emerged. LlamaIndex was designed specifically for LLM data applications, so has more of an enterprise bent.

Both LangChain and LlamaIndex have extensive libraries for ingesting, parsing, and extracting data from a vast array of data sources, from text, PDFs, and email to messaging systems and databases. Using these libraries takes the pain out of parsing each different data type and extracting the content from the formatting. That itself is worth the price of entry.

A simple RAG example

We will build a simple “Hello World” RAG application using Python, LangChain, and an OpenAI chat model. Combining the linguistic power of an LLM with the domain knowledge of a single document, our little app will allow us to ask the model questions in English, and it will answer our questions by referring to content in our document.

For our document, we’ll use the text of President Biden’s February 7, 2023, State of the Union Address. If you want to try this at home, you can download a text document of the speech at the link below.

download State of the Union 2023 Text file of President Biden’s February 7, 2023, State of the Union Address

A production-grade version of this app would allow private collections of documents (Word docs, PDFs, etc.) to be queried with English questions. Here we are building a simple system that does not have privacy, as it sends the document to a public model. Please don’t run this app using private documents.

We will use the hosted embedding and language models from OpenAI, and the open-source FAISS (Facebook AI Similarity Search) library as our vector store, to demonstrate a RAG application end to end with the least possible effort. In a subsequent article we will build a second simple example using a fully local LLM with no data sent outside the app. Using a local model involves more work and more moving parts, so it is not the ideal first example.

To build our simple RAG system we need the following components:

The preparatory steps:

pip install -U langchainpip install -U langchain_communitypip install -U langchain_openai

The source code for our RAG system:

# We start by fetching a document that loads the text of President Biden’s 2023 State of the Union Addressfrom langchain_community.document_loaders import TextLoaderloader = TextLoader('./stateOfTheUnion2023.txt')from langchain.text_splitter import CharacterTextSplitterfrom langchain_community.vectorstores import FAISSfrom langchain_openai.embeddings import OpenAIEmbeddingsimport osos.environ["OPENAI_API_KEY"] =<you will need to get an API ket from OpenAI># We load the document using LangChain’s handy extractors, formatters, loaders, embeddings, and LLMsdocuments = loader.load()text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)texts = text_splitter.split_documents(documents)# We use an OpenAI default embedding model# Note the code in this example does not preserve privacyembeddings = OpenAIEmbeddings()# LangChain provides API functions to interact with FAISSdb = FAISS.from_documents(texts, embeddings) # We create a 'retriever' that knows how to interact with our vector database using an augmented context# We could construct the retriever ourselves from first principles but it's tedious# Instead we'll use LangChain to create a retriever for our vector databaseretriever = db.as_retriever()from langchain.agents.agent_toolkits import create_retriever_tooltool = create_retriever_tool( retriever, "search_state_of_union", "Searches and returns documents regarding the state-of-the-union.")tools = [tool]# We wrap an LLM (here OpenAI) with a conversational interface that can process augmented requestsfrom langchain.agents.agent_toolkits import create_conversational_retrieval_agent# LangChain provides an API to interact with chat modelsfrom langchain_openai.chat_models import ChatOpenAIllm = ChatOpenAI(temperature = 0)agent_executor = create_conversational_retrieval_agent(llm, tools, verbose=True)input = "what is NATO?"result = agent_executor.invoke({“input": input})# Response from the modelinput = "When was it created?"result = agent_executor.invoke({“input": input})# Response from the model

As shown in the screenshot above, the model’s response to our first question is quite accurate:

NATO stands for the North Atlantic Treaty Organization. It is an intergovernmental military alliance formed in 1949. NATO's primary purpose is to ensure the collective defense of its member countries. It is composed of 30 member countries, mostly from North America and Europe. The organization promotes democratic values, cooperation, and security among its members. NATO also plays a crucial role in crisis management and peacekeeping operations around the world.

Finished chain.

And the model’s response to the second question is exactly right:

NATO was created on April 4, 1949.

Finished chain.

As we’ve seen, the use of a framework like LangChain greatly simplifies our first steps into LLM applications. LangChain is strongly recommended if you’re just starting out and you want to try some toy examples. It will help you get right to the meat of retrieval-augmented generation, meaning the document ingestion and the interactions between the vector database and the LLM, rather than getting stuck in the plumbing.

For scaling to a larger corpus and deploying a production application, a deeper dive into local LLMs, vector databases, and embeddings will be needed. Naturally, production deployments will involve much more nuance and customization, but the same principles apply. We will explore local LLMs, vector databases, and embeddings in more detail in future articles here.