RAG

Retrieval-Augmented Generation: Enhancing LLMs with External Knowledge

Context windows in LLMs represent the amount of text they can process in a single interaction:

  • 1M tokens ≈ 750 pages of text (current largest: Qwen2.5-Turbo)
  • 200K tokens ≈ 150 pages of text (previous standard)

While larger context windows might seem to reduce the need for RAG, both serve different purposes:

Large Context WindowsRAG
Limited to window sizeCan access unlimited external data
All data must be sent in promptSelective retrieval of relevant data
Higher API costs for large contextsMore cost-effective for large datasets
Better for single-document analysisBetter for vast knowledge bases

RAG remains valuable even with larger contexts because it enables:

  • Efficient handling of massive datasets
  • Dynamic data updates
  • Selective information retrieval
  • Cost optimization

Overview of RAG (Retrieval-Augmented Generation)

Retrieve (relevant data) -> Augment (query w/ context) -> Generate (response)

What is RAG?How it WorksBenefits
• System combining LLMs with data retrieval• Searches through your data sources• Access to unlimited external data
• Framework for intelligent data access• Finds relevant information• Real-time data updates
• Knowledge enhancement tool• Combines context with queries• Cost-effective scaling
• Private data integration system• Generates accurate responses• Domain-specific expertise

Core Concept

The idea of RAG is to build an app on top of an LLM, creating an Intelligent Agent capable of selecting between multiple data sources and using actions as tools to retrieve information.

Development Pattern

Development usually follows this pattern:

  1. One-off queries (e.g., notebook or CLI)
  2. On-Going Chat
  3. Real-time responses with streaming

Agentic RAG

Agentic RAG enables complex, multi-step reasoning and decision-making over data unlike standard RAG which handles simple, single-step queries.

Basics: There are Documents and a Router Engine

Document
  • Is on a VectorIndex Store
  • Has a Summary Index Text
Router Engine
  1. ) GOTO: Query Engine: Q&A (vector index *retrieval behaviour A => similar nodes)
  2. ) GOTO: Query Engine: summarization (summary index *retrieval behaviour B => all nodes)

#### Methods

  1. Routing: Decision-making to select appropriate tools
  2. Tool Use: Interface for agents to select and use tools with proper arguments
  3. Multi-step Reasoning: Using LLMs with various tools while maintaining memory

#### Use-cases

  • Handles complex research queries across multiple documents
  • Allows user intervention and guidance
  • Provides debugging capabilities
  • Enables detailed control and oversight

Commercial Applications of Agentic RAG

  • Research & Analysis

    • Market research automation
    • Academic literature review
    • Competitive intelligence gathering
    • Patent analysis
  • Legal & Compliance

    • Contract analysis
    • Legal document review
    • Regulatory compliance checking
    • Case law research
  • Healthcare

    • Medical literature synthesis
    • Patient record analysis
    • Clinical trial data review
    • Treatment protocol research
  • Financial Services

    • Investment research
    • Risk analysis reports
    • Financial document processing
    • Market trend analysis
  • Customer Service

    • Complex query resolution
    • Technical support assistance
    • Product documentation analysis
    • Knowledge base management

Key Differences between LangChain and LlamaIndex for RAG with tools

LlamaIndex

  • Specialized in data ingestion and indexing
  • Better document-centric operations
  • Simpler data connectors and indexing primitives
  • More focused on RAG-specific optimizations
  • Lighter weight, easier learning curve

LangChain

  • Broader framework for general LLM applications
  • More extensive tool/agent ecosystem
  • Greater flexibility for complex chains
  • Stronger focus on agent orchestration
  • Larger community and more integrations

Choice Guidelines

a) Use LlamaIndex if primarily focused on document processing/RAG b) Use LangChain if building complex multi-tool applications c) Can use both together (LlamaIndex for indexing, LangChain for orchestration)

Vector Search: The Core Technology

Vector search is the solution to context window limitations:

  1. Convert text to numbers:

    • Data is encoded as meaning rather than keywords
    • Numbers are called vectors (embeddings from the text)
    • Vector space contains all your vectors (meanings)
    • User prompt is also encoded into a vector (meaning)
  2. Process: a) Embed the data, embed the query b) Retrieve context from the vector space (similarity - nearby) c) Feed the context + query to the LLM for a response

Note: All interactions with the LLM are in plain English text. RAG libraries maintain mappings between embeddings and original text.

Illustration

Data Embedding and Storage:
Text: "The capital of France is Paris."
Embedding: [0.23, -0.11, 0.45, ...]
Stored Mapping: {[0.23, -0.11, 0.45, ...]: "The capital of France is Paris."}

Query Embedding:
Query: "What is the capital of France?"
Embedding: [0.21, -0.10, 0.44, ...]

Retrieve Context:
Search using query embedding [0.21, -0.10, 0.44, ...]
Retrieve closest embedding [0.23, -0.11, 0.45, ...]

Retrieve Original Text:
Lookup embedding [0.23, -0.11, 0.45, ...] in stored mapping
Retrieve text: "The capital of France is Paris."

Combine and Query:
Context + Query: "The capital of France is Paris. What is the capital of France?"
Send this combined text to the LLM

Production Readiness

To make the app production-ready:

  • Persist data and load it again
  • Implement near-live re-indexing

Building RAG Applications

Procedure:

  1. Load data + create index
  2. Create a Query Engine:
    • Retriever: fetch relevant context
    • Postprocessing: process retrieved content
    • Synthesizer: combine context with query for LLM

Requirements:

  • RAG framework (LangChain, Llama-index, etc.)
  • Model components:
    • LLM options:
      • OpenAI GPT-4 (128K context)
      • Qwen2.5-Turbo (1M context) - Alibaba's model
    • Embedding model for vector search

Related Links

Subscribe to AI Spectrum

Stay updated with weekly AI News and Insights delivered to your inbox