A TAXONOMY OF RAG
- chandrasekhar.kallipalli
- Oct 29, 2024
- 12 min read
The taxonomy of Retrieval-Augmented Generation (RAG) refers to the categorisation and organisation of the various components, concepts, techniques, and patterns involved in the RAG ecosystem. It helps create a structured framework to understand the different aspects of RAG, including how it functions, its applications, and the evolving innovations in this field.
Imagine a world where AI doesn't just generate responses, but intelligently retrieves and incorporates relevant information from vast knowledge bases. This isn't science fiction—it's the reality of Retrieval-Augmented Generation (RAG), a groundbreaking approach that's revolutionising the landscape of artificial intelligence.
But what exactly is RAG, and why should you care? Whether you're an AI enthusiast, a tech professional, or simply curious about the future of machine learning, understanding RAG is crucial. It's the key to unlocking more accurate, contextual, and trustworthy AI interactions. From enhancing chat-bots to powering advanced research tools, RAG is reshaping how we interact with and leverage artificial intelligence.
In this comprehensive guide, we'll dive deep into the fascinating world of RAG. We'll explore its basics, unpack its core components, and examine the cutting-edge methods driving its retrieval and generation processes. Let's embark on this journey to uncover and appreciate the intricate taxonomy of RAG.
Understanding RAG: Retrieval-Augmented Generation
A. Defining RAG
RAG enhances LLMs by integrating real-time retrieval, allowing models to fetch relevant, up-to-date information from external knowledge bases before generating responses. This approach ensures accurate, context-aware outputs, bridging the gap between static models and dynamic content. Today, about 60% of LLM applications use RAG to combine retrieval and generation for improved performance
RAG Basics:
Knowledge Cut-off Date:
LLMs are trained on vast amounts of data, but they are not always up to date. For example, GPT-4 has knowledge only up until April 2023. This is referred to as the knowledge cut-off date. Any events or information after this date are not available within the model itself.
Training Data Limitation:
LLMs are typically trained on public data, such as websites, books, and research papers. However, they do not have access to private or internal documents (e.g., company files or customer-specific data). This limits their ability to answer queries related to proprietary or restricted information.
Hallucinations:
LLMs generate text by predicting the next word in a sequence, but they are not designed to verify the accuracy of their statements. This can lead to what’s called hallucinations, where the model confidently provides responses that are factually incorrect or fabricated.
Context Window:
Each LLM has a limited context window, which refers to the maximum number of tokens (words or characters) the model can process at one time. If a query exceeds this limit, the extra tokens beyond the context window are ignored.
Parametric vs. Non-parametric Memory:
Parametric Memory: This refers to the knowledge stored within the parameters of the LLM during training. LLMs rely on this internal memory to answer questions based on data they have been trained on.
Non-parametric Memory: In RAG, the LLM can access external data sources (like knowledge bases) in real-time. This is referred to as non-parametric memory, where the LLM augments its responses with information retrieved from external databases.
Knowledge Base:
The external data source (such as databases or documents) that the RAG system retrieves information from is known as the knowledge base. This provides the LLM with access to up-to-date or proprietary information to enhance its responses.
User Query:
This is the prompt or question that the user sends to the system, which triggers the retrieval process.
Component | Function |
Retrieval | The retrieval process fetches relevant information from the knowledge base in response to the user’s query. The goal is to find the most pertinent data to augment the LLM’s response |
Augmentation | After retrieving relevant documents, the system augments the query by combining the user’s prompt with the retrieved data. This enriched query is then fed to the LLM for generating the response |
Generation | Finally, the LLM generates the output based on the augmented query, producing a more accurate and contextually relevant answer. |
B. Core components of RAG
The core components of RAG define the technical infrastructure that enables retrieval and generation to work together seamlessly.
Document Retriever: The document retriever is responsible for fetching relevant information from the knowledge base. It uses various techniques to identify and retrieve the most pertinent documents or passages based on the input query.
Knowledge Base: The knowledge base is the foundation of any RAG system, containing the external information that augments the generation process. It can be structured in various ways, such as:
Structure Type | Description | Example |
Vector Databases | Specialized for storing and querying high-dimensional vectors (embeddings)
| Pinecone, Weaviate, Milvus, Qdrant |
Document Databases
| Store semi-structured data as documents (often JSON-like) | MongoDB, Elasticsearch, Couchbase |
Graph Databases | Store data in nodes and edges, representing relationships | Neo4j, Amazon Neptune, ArangoDB |
Relational Databases | Organize data into tables with predefined schemas | PostgreSQL, MySQL, SQLite |
Indexing Pipeline: This involves creating and updating the knowledge base used for retrieval. Data is loaded, processed, and stored for quick access during the retrieval stage.
Chunking: Long documents are split into smaller, more manageable sections called “chunks” to improve searchability. ( Strategies )
Metadata: Metadata (like timestamps and authorship) is attached to documents to make retrieval more accurate and efficient.
Retrieval Techniques:
A. Dense Vector Retrieval
Dense vector retrieval is a powerful method that represents documents and queries as high-dimensional vectors in a continuous semantic space. This approach offers several advantages:
Captures semantic meaning (cosine similarity)
Handles synonyms and related concepts well
Efficient for large-scale retrieval
B. Sparse Vector Retrieval
Sparse vector retrieval, often based on traditional information retrieval techniques like TF-IDF or BM25, represents documents as sparse vectors of term frequencies.
Key characteristics:
Relies on exact term matches
Efficient for keyword-based searches
Implementation:
Create an inverted index of terms
Compute term frequencies and document frequencies
Calculate relevance scores using TF-IDF or BM25 algorithms
C. Hybrid Retrieval Approaches
Hybrid approaches combine the strengths of dense and sparse vector retrieval methods to achieve better performance:
Leverage both semantic understanding and exact matching
Adaptable to different types of queries and documents
Often outperform single-method approaches
Examples of hybrid techniques:
ColBERT: Combines BERT-based dense representations with late interaction
SPLADE: Uses sparse lexical representations with learned weights
Evaluation Metrics for RAG:
Evaluation is critical for assessing how well RAG systems perform in real-world scenarios. There are several key metrics that help evaluate both the retrieval and generation phases:
Precision: How many of the retrieved documents are actually relevant?
Recall: Of all the relevant documents available, how many were retrieved?
F1-score: A balance between precision and recall, giving an overall measure of retrieval performance.
Answer Faithfulness: Ensures that generated responses match the factual content in the retrieved documents, reducing hallucinations.
Latency: The speed at which the system retrieves information and generates responses is critical for real-time applications.
Hallucination Rate: This metric measures how often the model generates false or misleading information.
These metrics help determine the quality and reliability of the RAG system, ensuring it provides relevant and trustworthy responses.
Pipeline Design in RAG
A well-structured pipeline is critical for efficient RAG systems. The RAG pipeline consists of several stages that allow for smooth retrieval and generation of content. The main approaches to pipeline design include:

Image source: https://arxiv.org/pdf/2312.10997
1.Naive RAG:
This is a basic linear pipeline, where the process flows from retrieval to reading and then to generation. The retrieval system fetches documents or data relevant to the query, and the LLM generates the response based on this data.
2.Advanced RAG:
Advanced RAG pipelines introduce several stages to optimize the process and improve accuracy. This includes pre-retrieval interventions like query rewriting and post-retrieval stages like reranking. The result is a Rewrite-Retrieve-Rerank-Read model that provides more refined and accurate responses.
Pipeline Components:
Multi-query expansion: Multiple variations of the original query are generated using an LLM, and each variant is used to retrieve chunks from the knowledge base.
Sub-query expansion: Instead of generating query variations, a complex query is broken down into simpler sub-queries.
Step-back expansion: This approach abstracts the original query into a higher-level conceptual query for better retrieval.
Query transformation: The original user query is transformed into one more suitable for retrieval.
Query rewriting: The input query is rewritten for better retrieval accuracy. This is often necessary when the input may not be directly suitable for retrieval tasks.
HyDE (Hypothetical Document Embeddings):HyDE is a method where a language model (LLM) generates a hypothetical response or document based on the query. This hypothetical document is then used to retrieve real documents from a knowledge base. The idea is that the LLM-generated hypothesis can guide the retrieval system to find more relevant information.
Query Routing: Query Routing involves directing a query to the appropriate knowledge base, model, or data source based on the type of question or the domain. Instead of using a single retriever or knowledge source, the system routes queries dynamically to the most relevant sources.
Hybrid Retrieval:
In hybrid retrieval, the strategy combines different methods like keyword-based search with semantic similarity searches. It can also integrate sparse embeddings, dense embeddings, and knowledge graph-based searches for improved accuracy.
Iterative and Recursive Retrieval:
Iterative Retrieval: The system retrieves information iteratively, refining the retrieved documents after each round of generation.
Recursive Retrieval: It builds upon iterative retrieval by transforming the retrieval query after each generation to improve context.
Adaptive Retrieval:
This method introduces intelligence into retrieval, where the LLM determines the most appropriate moment and the most relevant content for retrieval, dynamically adjusting based on the interaction.
Contextual Compression:
This technique reduces the length of the retrieved information by extracting only the parts that are most relevant to the query. This reduces costs and improves system efficiency.
Reranking:
Reranking is used to refine the retrieved information from different sources and retrieval methods. Using rerankers such as multi-vector, Learning to Rank (LTR), and BERT-based techniques, the system improves the relevance of documents.
3.Modular RAG:
A modular approach breaks down the traditional RAG structure into interchangeable components, which allows for customisation based on specific tasks. Modules include retrievers, indexing, generation, as well as additional components like search and memory.
RAG Fusion: RAG Fusion improves upon traditional search systems by addressing limitations through a multi-query approach. It merges results from different queries or sources to create a more comprehensive response.
Routing and Task Adaptation:
Routing: This navigates through diverse data sources, selecting the optimal pathway based on the query type, domain, or other criteria.
Task Adapter: This module adapts RAG for specific downstream tasks like summarization, translation, or sentiment analysis, allowing for fine-tuning based on minimal examples.
These components work together to create efficient and scalable RAG pipelines, ensuring that the generated responses are accurate, context-aware, and relevant.
RAG Generation Techniques
A. Prompt-based generation
Prompt-based generation is a popular technique in RAG systems that leverages the power of large language models. This method involves crafting specific prompts that guide the model to generate appropriate responses.
B. Fine-tuning-based generation
Fine-tuning involves further training a pre-trained language model on task-specific data to improve its performance for RAG applications.
C. Few-shot learning in RAG
Few-shot learning enables RAG systems to generate responses with limited examples, making it particularly useful for scenarios with scarce training data.
In-context learning: Providing examples within the prompt
Meta-learning: Training the model to adapt quickly to new tasks
Transfer learning: Leveraging knowledge from related tasks
What is the Operations Stack?
The Operations Stack refers to the collection of layers and components that manage the functioning, optimisation, and security of RAG systems. These layers support data storage, model deployment, retrieval, generation, monitoring, and more, ensuring that the system operates reliably and at scale.
Core Layers of the Operations Stack
Data Layer
Role: The Data Layer is the backbone of the RAG system, responsible for creating and storing the knowledge base. It collects data from various source systems, transforms it into a usable format, and ensures it's ready for fast retrieval.
Importance: Without a well-structured Data Layer, retrieval systems cannot efficiently access the required information. This layer must be optimized to handle large amounts of data while ensuring low latency during retrieval operations.
Model Layer
Role: The Model Layer handles the deployment and management of the generative AI models (LLMs). This layer includes pre-trained models, custom fine-tuning, and optimization of inference operations.
Managed vs. Self-hosted Deployment:
Fully managed services (like AWS, Azure) take care of infrastructure and scaling.
Self-hosted deployment options (using Kubernetes, Docker) allow for more control but require significant management.
Edge Deployment runs models on local hardware or edge devices for privacy, reduced latency, and offline functionality.
Application Orchestration Layer
Role: This layer is responsible for managing the interactions between various components such as data sources, retrieval systems, generation models, and user interfaces.
Importance: It's the central coordinator, ensuring all processes work in harmony to deliver accurate and timely results.
Performance and Monitoring Layers
Monitoring Layer
Role: Continuous monitoring of the system is essential for tracking resource utilisation, detecting failure points, and measuring performance metrics like latency and error rates.
Security and Privacy Layer
Role: Ensuring the security and privacy of sensitive data is paramount. RAG systems must follow data privacy regulations like encryption, anonymization, and differential privacy to protect information in vector databases.
Security Features: Guardrails, access control, and continuous auditing are used to protect against data breaches or unauthorized access. Query validation and sanitization help ensure safe operations.
Caching Layer
Role: Caching is vital to reducing latency and minimizing costs in RAG systems. Given the high computational demands of both retrieval and generation, caching frequently requested data can significantly improve response times and cost efficiency.
Optimization and Efficiency Layers
Enhancement Layer
Role: This layer focuses on improving system efficiency, scalability, and usability. It's designed to enhance the overall performance of RAG systems by implementing features tailored to the specific requirements of the task at hand.
Cost Optimization Layer
Role: RAG systems, especially large-scale ones, are resource intensive. The Cost Optimization Layer manages resource allocation efficiently, reducing computational overheads and ensuring that the system operates cost-effectively.
Human Oversight and Transparency
Human-in-the-loop Layer
Role: Some tasks demand a higher degree of accuracy or ethical considerations, such as legal or medical queries. In these cases, human oversight is necessary to ensure that responses generated by the RAG system are appropriate and accurate.
Explainability and Interpretability Layer
Role: AI systems, especially in critical applications, need to be transparent and accountable. This layer ensures that the RAG system’s decisions are interpretable, providing transparency into why specific documents or data points were retrieved and how conclusions were drawn.
Importance: In domains like healthcare or finance, where accountability is crucial, explainability is essential for ensuring user trust.
Collaboration and Experimentation Layer
Role: This layer supports teams working on the development and experimentation of RAG systems. While it’s not always critical for day-to-day operations, it allows for the continuous improvement and testing of new features or models.
Importance: It fosters innovation and improvement by providing a structured environment for development, ensuring that new iterations of the system are properly tested before full-scale deployment.
Emerging Patterns in Retrieval-Augmented Generation (RAG)
1. Knowledge Graph-Powered RAG
Knowledge graphs organize data into structured entities and relationships, enhancing a system's ability to understand and reason with context. This structure not only improves the retrieval process but also equips RAG systems with improved explainability and reasoning capabilities.
Key Concepts:
GraphRAG: Developed by Microsoft, this framework automatically creates knowledge graphs from source documents. The system then leverages these graphs during retrieval to ensure more precise and semantically accurate responses.
Graph Communities: Partitioning entities and relationships into clusters or communities, allowing for more efficient and focused retrieval.
Community Summaries: LLM-generated summaries for each graph community provide insights into the topical structure and semantics of the data.
2. Multimodal RAG
While most traditional RAG systems focus on text-based retrieval and generation, Multimodal RAG extends this capability to handle data in various formats, such as images, videos, and audio, alongside text. This cross-modal capability vastly expands the range of applications for RAG systems.
Key Techniques:
Multimodal Embeddings: Unified vector representations that encode multiple data types, allowing for retrieval across different modalities.
Contrastive Learning: Used to align data across various modalities by ensuring that semantically similar items (such as an image and its description) are brought closer together in the shared embedding space.
Applications:
Systems like CLIP (Contrastive Language-Image Pre-training) by OpenAI leverage contrastive learning to retrieve and generate content across text and image modalities.
3. Agentic RAG
In Agentic RAG, LLM-based agents are employed to adapt workflows based on the query type and document complexity. This dynamic adjustment enhances the accuracy and relevance of RAG outputs in complex retrieval tasks.
Key Concepts:
Routing Agents: These agents are responsible for directing queries to the most appropriate knowledge sources based on the query's intent or context.
Query Planning Agents: For complex queries, these agents break them down into sub-queries and manage their execution across various retrieval pipelines.
Adaptive Frameworks: Dynamically adjust the retrieval and generation strategies to provide relevant responses based on the evolving context and data.
Technology providers:
Category | Technology Providers |
Model Access, Training & Fine-Tuning | OpenAI, HuggingFace, Google Vertex AI, Anthropic, AWS Bedrock, AWS Sagemaker, Cohere, Azure Machine Learning, IBM Watson AI, Mistral AI, Salesforce Einstein, Databricks Dolly, NVIDIA NeMo, EleutherAI |
Vector DB and Indexing | Pinecone, Milvus, Chroma, Weaviate, Deep Lake, Qdrant, Elasticsearch, Vespa, Redis (Vector Search Support), Vald, Zilliz, Marqo, PGVector, MongoDB (with vector capabilities), SingleStore
|
Data Loading | Snorkel AI, LlamaIndex, LangChain, Scale AI, Labelbox, Superb AI, Explorium, Roboflow, Datature, V7 Labs, Clarifai |
Application Framework | LangChain, LlamaIndex, Haystack, CrewAI (Agentic Orchestration), AutoGen (Agentic Orchestration), LangGraph (Agentic Orchestration), Rasa (Conversational AI), Flyte, Prefect, Airflow, Metaflow |
Prompt Engineering | W&B (Weights & Biases), PromptLayer, TruLens, TruEra, PromptHero, TextSynth |
Deployment Frameworks | Vllm, TensorRT-LLM, ONNX Runtime, KubeFlow, MLflow, Ray Serve, Triton Inference Server, Seldon Deploy |
Deployment & Inferencing | AWS, GCP, OpenAI API, Azure, IBM Cloud, Oracle Cloud Infrastructure, Heroku, Kubernetes, DigitalOcean, Vercel |
Monitoring | HoneyHive, TruEra, Fiddler AI, Arize AI, Aporia, WhyLabs, Evidently AI, Superwise, Monte Carlo, Datadog |
Proprietary LLMs/VLMs | GPT series by OpenAI, Gemini series by Google, Claude series by Anthropic, Command series by Cohere, Jurassic by AI21 Labs, PaLM by Google, LaMDA by Google |
Open Source LLMs | Llama series by Meta, Mixtral by Mistral, Falcon by TII, Vicuna by LMSYS, GPT-NeoX by EleutherAI, Pythia by EleutherAI, Dolly 2.0 by Databricks, Phi by Microsoft |
Small Language Models | Phi by Microsoft, GPT-Neo by EleutherAI, DistilBERT by HuggingFace, TinyBERT, ALBERT (Lite BERT) by Google, MiniLM by Microsoft, DistilGPT2, Reformer by Google, T5-Base |
Managed RAG Solutions | OpenAI File Search, Amazon Bedrock, Knowledge Bases, Azure AI File Search, Claude Projects, Vectorize.io |
Knowledge Graph and Ontology | Neo4j, Stardog, TerminusDB, TigerGraph |
Security and Privacy | Hazy, Duality, BigID |
Synthetic Data | Mostly AI, Tonic.ai, Synthesis AI |
Others | Cohere reranker, Unstructured.io |
References:
1. Taxonomy PDF: LinkedIn