RAG

Context windows in LLMs represent the amount of text they can process in a single interaction:

1M tokens ≈ 750 pages of text (current largest: Qwen2.5-Turbo)
200K tokens ≈ 150 pages of text (previous standard)

While larger context windows might seem to reduce the need for RAG, both serve different purposes:

Large Context Windows	RAG
Limited to window size	Can access unlimited external data
All data must be sent in prompt	Selective retrieval of relevant data
Higher API costs for large contexts	More cost-effective for large datasets
Better for single-document analysis	Better for vast knowledge bases

RAG remains valuable even with larger contexts because it enables:

Efficient handling of massive datasets
Dynamic data updates
Selective information retrieval
Cost optimization

Overview of RAG (Retrieval-Augmented Generation)

Retrieve (relevant data) -> Augment (query w/ context) -> Generate (response)

What is RAG?	How it Works	Benefits
• System combining LLMs with data retrieval	• Searches through your data sources	• Access to unlimited external data
• Framework for intelligent data access	• Finds relevant information	• Real-time data updates
• Knowledge enhancement tool	• Combines context with queries	• Cost-effective scaling
• Private data integration system	• Generates accurate responses	• Domain-specific expertise

Core Concept

The idea of RAG is to build an app on top of an LLM, creating an Intelligent Agent capable of selecting between multiple data sources and using actions as tools to retrieve information.

Development Pattern

Development usually follows this pattern:

One-off queries (e.g., notebook or CLI)
On-Going Chat
Real-time responses with streaming

Agentic RAG

Agentic RAG enables complex, multi-step reasoning and decision-making over data unlike standard RAG which handles simple, single-step queries.

Basics: There are Documents and a Router Engine

Document

Is on a VectorIndex Store
Has a Summary Index Text

Router Engine

) GOTO: Query Engine: Q&A (vector index *retrieval behaviour A => similar nodes)
) GOTO: Query Engine: summarization (summary index *retrieval behaviour B => all nodes)

#### Methods

Routing: Decision-making to select appropriate tools
Tool Use: Interface for agents to select and use tools with proper arguments
Multi-step Reasoning: Using LLMs with various tools while maintaining memory

#### Use-cases

Handles complex research queries across multiple documents
Allows user intervention and guidance
Provides debugging capabilities
Enables detailed control and oversight

Commercial Applications of Agentic RAG

Research & Analysis
- Market research automation
- Academic literature review
- Competitive intelligence gathering
- Patent analysis
Legal & Compliance
- Contract analysis
- Legal document review
- Regulatory compliance checking
- Case law research
Healthcare
- Medical literature synthesis
- Patient record analysis
- Clinical trial data review
- Treatment protocol research
Financial Services
- Investment research
- Risk analysis reports
- Financial document processing
- Market trend analysis
Customer Service
- Complex query resolution
- Technical support assistance
- Product documentation analysis
- Knowledge base management

Key Differences between LangChain and LlamaIndex for RAG with tools

LlamaIndex

Specialized in data ingestion and indexing
Better document-centric operations
Simpler data connectors and indexing primitives
More focused on RAG-specific optimizations
Lighter weight, easier learning curve

LangChain

Broader framework for general LLM applications
More extensive tool/agent ecosystem
Greater flexibility for complex chains
Stronger focus on agent orchestration
Larger community and more integrations

Choice Guidelines

a) Use LlamaIndex if primarily focused on document processing/RAG b) Use LangChain if building complex multi-tool applications c) Can use both together (LlamaIndex for indexing, LangChain for orchestration)

Vector Search: The Core Technology

Vector search is the solution to context window limitations:

Convert text to numbers:
- Data is encoded as meaning rather than keywords
- Numbers are called vectors (embeddings from the text)
- Vector space contains all your vectors (meanings)
- User prompt is also encoded into a vector (meaning)
Process: a) Embed the data, embed the query b) Retrieve context from the vector space (similarity - nearby) c) Feed the context + query to the LLM for a response

Note: All interactions with the LLM are in plain English text. RAG libraries maintain mappings between embeddings and original text.

Illustration

Data Embedding and Storage:
Text: "The capital of France is Paris."
Embedding: [0.23, -0.11, 0.45, ...]
Stored Mapping: {[0.23, -0.11, 0.45, ...]: "The capital of France is Paris."}

Query Embedding:
Query: "What is the capital of France?"
Embedding: [0.21, -0.10, 0.44, ...]

Retrieve Context:
Search using query embedding [0.21, -0.10, 0.44, ...]
Retrieve closest embedding [0.23, -0.11, 0.45, ...]

Retrieve Original Text:
Lookup embedding [0.23, -0.11, 0.45, ...] in stored mapping
Retrieve text: "The capital of France is Paris."

Combine and Query:
Context + Query: "The capital of France is Paris. What is the capital of France?"
Send this combined text to the LLM

Production Readiness

To make the app production-ready:

Persist data and load it again
Implement near-live re-indexing

Building RAG Applications

Procedure:

Load data + create index
Create a Query Engine:
- Retriever: fetch relevant context
- Postprocessing: process retrieved content
- Synthesizer: combine context with query for LLM

Requirements:

RAG framework (LangChain, Llama-index, etc.)
Model components:
- LLM options:
  - OpenAI GPT-4 (128K context)
  - Qwen2.5-Turbo (1M context) - Alibaba's model
- Embedding model for vector search

RAG