
Retrieval-Augmented Generation (RAG) use is on the rise. Here are 10 best practices for building RAG systems based on real-world experiences.
Generative AI applications are becoming increasingly commonplace across enterprises. According to a recent 451 research survey, 71% of respondent enterprises have adopted or are in the process of adopting generative AI in their organization. These applications rely on large language models (LLMs) like GPT, Gemini, and Llama’s family of models as their backbone. There are significant drawbacks to using these standard LLMs out of the box for enterprise applications. First, the quality of responses from these standard LLMs is dependent on the quality of the training data. They also have higher hallucination risks. Finally, they are not capable of providing contextual answers, which is very critical for enterprise users.
One strategy to overcome the challenges highlighted above is the use of Retrieval Augmented Generation (RAG) systems. At a high level, RAG systems use techniques from information retrieval and search augmented with LLMs to provide more contextual, grounded answers for users. Figure 1 shows a high-level architecture for a basic RAG system.
Figure 1. The RAG System Architecture
The input to the document processing module is the corpus of data from which the user query should be answered. Every document in the corpus is split into chunks by the chunking module; then, these chunks are converted into vectors/embeddings using an embedding model and stored in a vector database. Pinecone and Milvus are popular choices for these vector databases. When a query is provided by the user, the query converter takes in the query and converts it into a vector, which is passed on to the retriever module. The goal of the retriever module is to compare the query vector with the vectors in the vector databases to retrieve the most relevant chunks for the user query. These chunks, along with the query, are passed on to the answer generator, which typically uses a standard LLM like a GPT-4 or any of the open-source LLMs. The answer generator uses only the relevant chunks to answer the given user query. This answer is post-processed to check for hallucinations, provide explainability for the generated answer, and perform other validations. The final processed answer is then displayed to the user.
Over the last year, we have built a number of RAG systems, and in the rest of the article, we will provide the top 10 best practices based on our experience.
Best Practice 1: Consider query classification before running RAG
The RAG system is equipped only to answer the questions from the document corpus, which provides the context for the answers. Depending on the use case, it is not possible for every user to be aware of the corpus and limit their questions. If the questions are beyond the coverage of the document corpus, this might result in hallucinations in the answers. One way of mitigating this is to enhance the prompt, instructing the LLM to answer questions only if the answer is available in the context. However, this may not be sufficient. In that scenario, one additional module to consider is a query classifier that can be trained to classify if the query is meant to be answered by retrieval from a document corpus. Also, meta-data regarding the corpus and the logs of questions answered successfully in the past can be used to train the classifier so that it can also infer if the query can be answered by this corpus.
Best Practice 2: Chunk size and method matters
Chunking is an important component in the document processing module. This component breaks down the document into manageable segments called chunks that are fed into the vector database. The size of the chunk plays a significant role in the final answer presented to the user. Larger chunks provide better context, while smaller chunks help in matching the user query. One way to combine the advantages is to create both small and large chunks. The small chunks are used for matching and retrieval, while the larger chunks are used for answer synthesis. This method is called Small-to-Big chunking and retrieval. Also, there are multiple ways in which we can chunk the document. Recursive character chunking, character chunking, and token chunking are the different methods. In our experiments, token chunking has worked the best.
Best Practice 3: The choice of the embedding model
The chunks generated from the chunking model are converted into embeddings that are then stored in a vector database. The quality of the embeddings is critical to semantically match the input query from the user. There are hundreds of embedding models available to generate these embeddings. The Massive Text Embedding Benchmark (MTEB) provides a great starting point. A robust approach is to use the top five embedding models on this leaderboard and experiment to understand which model is most effective for the implemented system.
Best Practice 4: The vector database
There are a number of open-source vector databases, such as Weaviate, Chroma, Qdrant, Milvus, Faiss, and others. There are various metrics on which these databases can be evaluated. These metrics will vary depending on the corpus size, number of users, etc. Qualitative metrics include the ability to accommodate different types of indices, scalability, support on cloud, and other factors. The choice of the vector database does not significantly affect the quality of the answers. In our evaluations, Milvus is a good choice to start with.
Best Practice 5: Prompts are tied to the LLM
Prompt engineering is crucial to ensure that the LLM provides accurate and effective responses. A common mistake is to keep the prompt the same and replace the LLM. This reduces the effectiveness of the overall system. As the LLM changes, it is important to iteratively test and change the prompts.
Best Practice 6: There is no single best-ranking method for the retriever
The quality of the answers provided by the LLM is dependent on the chunks that are retrieved by the retriever in response to the user query. Recursive retrieval, hierarchical retrieval, and other re-ranking methods like stuff, map re-rank, refine, and MapReduce are the different options typically used in RAG systems. The accuracy of the system can vary up to 50% depending on the use case and the option implemented. Another interesting technique to use is to supplement the data corpus with synthetically generated data. For example, if there are questions for which there are no answers from the corpus, consider indexing synthetic data vetted by human experts alongside the original corpus.
See also: Graph RAG: Pushing the Boundaries of AI with Smarter Information Retrieval
Best Practice 7: Provide explainability and hallucination scores
One of the advantages of using RAG when compared to a standard LLM is the control over the corpus from which the answers are generated. Although these answers are still generative in nature, they come from the top chunks that are provided by the retriever module. We can use this to provide citations for our answers and also explainability around how these answers are generated. Also, we can input the answers generated along with the reference chunks to a Hallucination Identification module to check if there is a hallucination. There are multiple methods that can be used to identify hallucination, like Self-Check GPT, Bert Score, or Natural Language Inference (NLI). In our experiments, NLI works the best. The system can also provide a hallucination score for the input answer and reference chunks. The explainability and hallucination scores are crucial for the user to have confidence in the developed system.
Best Practice 8: Design a comprehensive RAG evaluation method
Creating a ground truth set of questions with human evaluation is very critical to understanding the quality of the system. In addition to standard evaluation, it is important to run the same query multiple times to ensure the RAG system provides consistent answers. While evaluating the quality of the answer in an automated fashion might be challenging, it is possible to design automated evaluation systems to check the consistency of the answers.
Best Practice 9: Consider Multi-Modal RAG for better results
Most real-world documents contain images, and a number of answers come from the images. If the document processing step can only handle text, it will miss questions that can be addressed only by images. Hence, it is important to test if the document processing module can process images successfully. One of the best ways to process images is to extract images from the text, summarize these images, and create embeddings from them. This will help in indexing the images and ensuring that the RAG system can answer questions from images.
Best Practice 10: Keep updating every module
Due to the rapid pace of research into models and methods in the generative AI space, it is important to be aware of the latest advancements. Based on the current literature, ensure that the chunking method, embedding model, and vector database are constantly updated to provide the best-in-class results.