Overview of RAG (Retrieval-Augmented Generation) systems

Hi,

It’s been a while since my last post, mostly because of my own laziness. Over the past year, I’ve been working on several projects, one of which is a small RAG (Retrieval-Augmented Generation) system. I implemented it to combine external knowledge (in this case internal safety documents) with a large language model (LLM). This approach allows the use of data that the LLM wasn’t trained on and also helps reduce hallucinations.

A RAG system consists on several components like embeddings, vector storage, ranking and answer generation. This post gives just a brief overview, but I plan a deep-dive into each part. Please have a look at following papers:

Retrieval-Augmented Generation for LLM: A Survey by Gao et al. 2024
Corrective retrieval augmented generation by Yan et al. 2024
RAG Cheat sheet

Fig.1: General overview of a RAG system

Embeddings

The embeddings component converts data into numerical representations or vectors, capturing its semantic meaning. These embeddings enable the system to identify similar data points. Embeddings can be created from multiple data types (text, image, audio, video) by pre-trained models such as BERT, SentenceTransformer, ResNet, or Whisper. These models can also be fine-tuned on your specific data or even trained from scratch.

The quality of embeddings impacts retrieval accurancy and overall system performance. Additionally, the granularity of the embedded data plays a crucial role: coarse-grained data contain more information, but can distract the retriever and LLM, while fine-grained data can lack essential knowledge. Typically data is split into chunks using techniques such as fixed-length windows, sliding windows, or others. Semi-structured text is more challenging, as tables and images require different methods to create embeddings. To enhance or accelerate retrieval, metadata can be stored alongside the chunks.

Retrieval and Vector Database

The retrieval process identifies the most relevant documents for a user query by calculating similarity scores, often using cosine similarity, between the query embedding and stored embeddings. Vector databases are optimized for handling high-dimensional embeddings at scale via the use of special indexing techniques such as approximate nearest neighbor (ANN).

During retrieval, the vector database returns the top N most similar results. This component prioritize low latency, often at the expense of accuracy, which is why an additional reranking step is commonly recommended to refine the results. Popular vector databases are e.g. PineCone, Qdrant, ElasticSearch or FAISS.

Ranking

The ranking component orders the N candidates retrieved earlier based on their relevance to the user query. This is typically achieved using a model that applies more complex scoring criteria. While reranking improves the quality of results, not all systems require it, as it adds more complexity to the system.

Reranking models often need to be fine-tuned for the specific task, but suitable training data may not always be available. In such cases, additional steps may be required to generate training data, such as semi-supervised learning, active learning, or data generation using LLMs. Typical models for reranking are cross-encoders like bge-reranker or ms-marco. Alternatively, prompting a large language model for ranking is another option.

Generation

The LLM generates a response by combining the user query and the selected documents into a single prompt. The quality and relevance of the response depends on the instructions provided in the prompt. For chatbots, conversational history can also be included to maintain context.

Typically, only one retrieval step is performed. However, for complex problems or multi-step reasoning, multiple LLM requests may be required. This process, often referred to as augmentation, involves the LLM acting as a judge or orchestrator. Detailed discussion of augmentation is beyond the scope of this post.

Evaluation

RAG evaluation involves two distinct targets: retrieval and generation. For retrieval, context relevance and noise robustness are key factors in assessing quality, while for generation, key factors like answer faithfulness, answer relevance, negative rejection, information integration, and counterfactual robustness are important.

Retrieval aspects:

Context relevance: Ensures that retrieved information is precise and directly related to the query.
Noise robustness: Measures the system’s ability to ignore unrelated or irrelevant documents.

Generation aspects:

Answer faithfulness: Ensures that the generated answers remain true to the retrieved context.
Answer relevance: Assesses whether the generated response directly addresses the user query.
Negative rejection: Evaluates the system’s ability to refrain from answering when insufficient reliable information is available.
Information integration: Measures the system’s capability to synthesize knowledge from multiple sources.
Counterfactual robustness: Ensures the identification and dismissal of incorrect or misleading information.

The primary goal is to ensure relevance and consistency while avoiding contradictions. RAG evaluation is more complex than many other NLP tasks. Framework like RAGES provide tools for evaluation. Alternative, Gao et al. 2024 combines specific metrics (e.g. F1-score, Precision, Recall and so on) to each of those listed aspects.

This is a brief overview of the components of a RAG system.

Thank you for your attention.

Embeddings#

Retrieval and Vector Database#

Ranking#

Generation#

Evaluation#

Embeddings

Retrieval and Vector Database

Ranking

Generation

Evaluation