RAG
Retrieval-Augmented Generation: Enhancing LLMs with External Knowledge
Context windows in LLMs represent the amount of text they can process in a single interaction:
- 1M tokens ≈ 750 pages of text (current largest: Qwen2.5-Turbo)
- 200K tokens ≈ 150 pages of text (previous standard)
While larger context windows might seem to reduce the need for RAG, both serve different purposes:
Large Context Windows | RAG |
---|---|
Limited to window size | Can access unlimited external data |
All data must be sent in prompt | Selective retrieval of relevant data |
Higher API costs for large contexts | More cost-effective for large datasets |
Better for single-document analysis | Better for vast knowledge bases |
RAG remains valuable even with larger contexts because it enables:
- Efficient handling of massive datasets
- Dynamic data updates
- Selective information retrieval
- Cost optimization
Overview of RAG (Retrieval-Augmented Generation)
Retrieve (relevant data) -> Augment (query w/ context) -> Generate (response)
What is RAG? | How it Works | Benefits |
---|---|---|
• System combining LLMs with data retrieval | • Searches through your data sources | • Access to unlimited external data |
• Framework for intelligent data access | • Finds relevant information | • Real-time data updates |
• Knowledge enhancement tool | • Combines context with queries | • Cost-effective scaling |
• Private data integration system | • Generates accurate responses | • Domain-specific expertise |
Core Concept
The idea of RAG is to build an app on top of an LLM, creating an Intelligent Agent capable of selecting between multiple data sources and using actions as tools to retrieve information.
Development Pattern
Development usually follows this pattern:
- One-off queries (e.g., notebook or CLI)
- On-Going Chat
- Real-time responses with streaming
Agentic RAG
Agentic RAG enables complex, multi-step reasoning and decision-making over data unlike standard RAG which handles simple, single-step queries.
Basics: There are Documents and a Router Engine
Document
- Is on a VectorIndex Store
- Has a Summary Index Text
Router Engine
- ) GOTO: Query Engine: Q&A (vector index *retrieval behaviour A => similar nodes)
- ) GOTO: Query Engine: summarization (summary index *retrieval behaviour B => all nodes)
#### Methods
- Routing: Decision-making to select appropriate tools
- Tool Use: Interface for agents to select and use tools with proper arguments
- Multi-step Reasoning: Using LLMs with various tools while maintaining memory
#### Use-cases
- Handles complex research queries across multiple documents
- Allows user intervention and guidance
- Provides debugging capabilities
- Enables detailed control and oversight
Commercial Applications of Agentic RAG
-
Research & Analysis
- Market research automation
- Academic literature review
- Competitive intelligence gathering
- Patent analysis
-
Legal & Compliance
- Contract analysis
- Legal document review
- Regulatory compliance checking
- Case law research
-
Healthcare
- Medical literature synthesis
- Patient record analysis
- Clinical trial data review
- Treatment protocol research
-
Financial Services
- Investment research
- Risk analysis reports
- Financial document processing
- Market trend analysis
-
Customer Service
- Complex query resolution
- Technical support assistance
- Product documentation analysis
- Knowledge base management
Key Differences between LangChain and LlamaIndex for RAG with tools
LlamaIndex
- Specialized in data ingestion and indexing
- Better document-centric operations
- Simpler data connectors and indexing primitives
- More focused on RAG-specific optimizations
- Lighter weight, easier learning curve
LangChain
- Broader framework for general LLM applications
- More extensive tool/agent ecosystem
- Greater flexibility for complex chains
- Stronger focus on agent orchestration
- Larger community and more integrations
Choice Guidelines
a) Use LlamaIndex if primarily focused on document processing/RAG b) Use LangChain if building complex multi-tool applications c) Can use both together (LlamaIndex for indexing, LangChain for orchestration)
Vector Search: The Core Technology
Vector search is the solution to context window limitations:
-
Convert text to numbers:
- Data is encoded as meaning rather than keywords
- Numbers are called vectors (embeddings from the text)
- Vector space contains all your vectors (meanings)
- User prompt is also encoded into a vector (meaning)
-
Process: a) Embed the data, embed the query b) Retrieve context from the vector space (similarity - nearby) c) Feed the context + query to the LLM for a response
Note: All interactions with the LLM are in plain English text. RAG libraries maintain mappings between embeddings and original text.
Illustration
Data Embedding and Storage:
Text: "The capital of France is Paris."
Embedding: [0.23, -0.11, 0.45, ...]
Stored Mapping: {[0.23, -0.11, 0.45, ...]: "The capital of France is Paris."}
Query Embedding:
Query: "What is the capital of France?"
Embedding: [0.21, -0.10, 0.44, ...]
Retrieve Context:
Search using query embedding [0.21, -0.10, 0.44, ...]
Retrieve closest embedding [0.23, -0.11, 0.45, ...]
Retrieve Original Text:
Lookup embedding [0.23, -0.11, 0.45, ...] in stored mapping
Retrieve text: "The capital of France is Paris."
Combine and Query:
Context + Query: "The capital of France is Paris. What is the capital of France?"
Send this combined text to the LLM
Production Readiness
To make the app production-ready:
- Persist data and load it again
- Implement near-live re-indexing
Building RAG Applications
Procedure:
- Load data + create index
- Create a Query Engine:
- Retriever: fetch relevant context
- Postprocessing: process retrieved content
- Synthesizer: combine context with query for LLM
Requirements:
- RAG framework (LangChain, Llama-index, etc.)
- Model components:
- LLM options:
- OpenAI GPT-4 (128K context)
- Qwen2.5-Turbo (1M context) - Alibaba's model
- Embedding model for vector search
- LLM options:
Related Links
Subscribe to AI Spectrum
Stay updated with weekly AI News and Insights delivered to your inbox