Retrieval-Augmented Generation (RAG) Guide
Table of Contents
Overview
Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by combining them with a knowledge base of relevant documents. This allows the model to access external information during generation, improving accuracy and reducing hallucinations.
Core Components
-
Document Processing Pipeline
- Document loading
- Text chunking
- Vector embeddings generation
- Vector store integration
-
Retrieval System
- Semantic search
- Relevance ranking
- Context window management
-
Generation System
- Prompt engineering
- LLM integration
- Response synthesis
Implementation Steps
1. Basic Implementation
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
# 1. Load Documents
loader = TextLoader("path/to/your/doc.txt")
documents = loader.load()
# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
texts = text_splitter.split_documents(documents)
# 3. Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(texts, embeddings)
# 4. Create retrieval chain
llm = OpenAI()
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
# 5. Query the system
query = "What is the main topic of the document?"
response = qa_chain.run(query)
2. Advanced Implementation with Custom Components
import numpy as np
from typing import List, Dict
class CustomRAG:
def __init__(self, embedding_model, llm_model, vector_store):
self.embedding_model = embedding_model
self.llm_model = llm_model
self.vector_store = vector_store
def process_document(self, text: str, chunk_size: int = 1000) -> List[Dict]:
"""Split document and create embeddings"""
chunks = self._split_text(text, chunk_size)
embeddings = [self.embedding_model.embed(chunk) for chunk in chunks]
return self._store_embeddings(chunks, embeddings)
def retrieve(self, query: str, k: int = 3) -> List[str]:
"""Retrieve relevant contexts"""
query_embedding = self.embedding_model.embed(query)
return self._similarity_search(query_embedding, k)
def generate_response(self, query: str, contexts: List[str]) -> str:
"""Generate response using LLM"""
prompt = self._create_prompt(query, contexts)
return self.llm_model.generate(prompt)
def _similarity_search(self, query_embedding: np.ndarray, k: int) -> List[str]:
"""Implement vector similarity search"""
return self.vector_store.search(query_embedding, k)
def _create_prompt(self, query: str, contexts: List[str]) -> str:
"""Create prompt with retrieved contexts"""
context_str = "\n".join(contexts)
return f"""
Context: {context_str}
Question: {query}
Answer based on the context provided:"""
# Usage example
rag_system = CustomRAG(
embedding_model=YourEmbeddingModel(),
llm_model=YourLLMModel(),
vector_store=YourVectorStore()
)
# Process documents
rag_system.process_document("Your document text here")
# Query the system
query = "Your question here"
contexts = rag_system.retrieve(query)
response = rag_system.generate_response(query, contexts)
Advanced Features
1. Hybrid Search
Combine semantic and keyword search for better retrieval:
def hybrid_search(self, query: str, k: int = 3):
semantic_results = self.semantic_search(query, k)
keyword_results = self.keyword_search(query, k)
return self._merge_results(semantic_results, keyword_results)
2. Re-ranking
Implement a re-ranking step to improve retrieval quality:
def rerank_results(self, query: str, initial_results: List[str], k: int = 3):
scores = []
for result in initial_results:
relevance_score = self.compute_relevance(query, result)
scores.append((result, relevance_score))
return sorted(scores, key=lambda x: x[1], reverse=True)[:k]
Best Practices
-
Document Processing
- Use appropriate chunk sizes (typically 500-1000 tokens)
- Maintain context with chunk overlaps
- Clean and preprocess text properly
-
Retrieval Optimization
- Implement caching for frequent queries
- Use hybrid search approaches
- Consider metadata filtering
-
Prompt Engineering
- Include clear instructions
- Maintain context window limits
- Structure prompts consistently
-
System Monitoring
- Track retrieval quality metrics
- Monitor response latency
- Log user feedback
Common Pitfalls to Avoid
- Chunk sizes too large or small
- Insufficient context overlap
- Poor prompt engineering
- Ignoring edge cases
- Not handling rate limits
Performance Optimization
-
Batch Processing
- Process documents in batches
- Use parallel processing where possible
- Implement caching mechanisms
-
Vector Store Optimization
- Choose appropriate index types
- Implement regular maintenance
- Consider scaling strategies
RAG Systems: Theoretical Foundations and Advanced Concepts
1. Theoretical Foundations
1.1 Information Retrieval Theory
-
Vector Space Model (VSM)
- Documents and queries are represented as vectors in high-dimensional space
- Similarity measured through cosine similarity: cos(θ) = (d⋅q)/(||d||⋅||q||)
- Term frequency-inverse document frequency (TF-IDF) weighting
- Limitations in semantic understanding
-
Probabilistic Retrieval Models
- Binary Independence Model (BIM)
- Probability Ranking Principle (PRP)
- Language Model approaches
- Bayesian Networks for document retrieval
1.2 Neural Information Retrieval
-
Dense Retrieval
- Bi-encoder architecture
- Cross-encoder architecture
- Late interaction models
- Early interaction models
-
Semantic Matching Theory
- Lexical matching vs. semantic matching
- Word embedding spaces
- Contextual embeddings
- Cross-attention mechanisms
2. RAG Architecture Theory
2.1 Component Integration Theory
-
Retrieval-Generation Interface
P(y|x) = ∑z P(y|z,x)P(z|x)
where:
- y is the generated response
- x is the input query
- z represents retrieved documents
-
Information Flow Models
I(X;Y) = I(X;Z) + I(X;Y|Z)
where:
- I(X;Y) is mutual information
- I(X;Z) is retrieval information
- I(X;Y|Z) is generation information
2.2 Theoretical Optimization Framework
- Joint Optimization Problem
where:argmax_{θ_r,θ_g} E[log P_θ_g(y|z,x) + λ log P_θ_r(z|x)]
- θ_r are retrieval parameters
- θ_g are generation parameters
- λ is a balancing factor
3. Advanced Theoretical Concepts
3.1 Multi-Hop Reasoning Theory
-
Graph-based Knowledge Propagation
h_i^(l+1) = σ(∑_j α_ij W h_j^(l))
where:
- h_i is node representation
- α_ij is attention weight
- W is learnable parameter
-
Chain-of-Thought Integration
P(y|x) = ∑_{z_1...z_n} P(y|z_n,x)∏P(z_i|z_{i-1},x)
3.2 Uncertainty Quantification
-
Bayesian RAG Framework
P(y|x) = ∫∫ P(y|z,θ)P(z|x,φ)P(θ)P(φ)dθdφ
where:
- θ are generation parameters
- φ are retrieval parameters
-
Confidence Estimation
conf(y) = -∑_i P(y_i|x,z)log P(y_i|x,z)
4. Theoretical Challenges and Solutions
4.1 Context Window Management
-
Information Density Theory
ID(z) = -log P(z|x)/len(z)
where ID(z) is information density of context z
-
Optimal Context Selection
z* = argmax_z [I(y;z|x) - λ|z|]
where |z| is context length
4.2 Knowledge Integration
- Knowledge Fusion Models
where:K(x) = α K_retrieved(x) + (1-α) K_parametric(x)
- K_retrieved is retrieved knowledge
- K_parametric is model knowledge
- α is dynamic weighting
5. Advanced Retrieval Theory
5.1 Cross-Attention Retrieval
score(q,d) = attention(BERT(q), BERT(d))
5.2 Hybrid Retrieval Models
score_final = λ⋅score_dense + (1-λ)⋅score_sparse
5.3 Learning to Rank Theory
-
Pairwise Ranking Loss
L = max(0, margin - score_pos + score_neg)
-
ListNet Ranking
L = -∑_i P_i log(Q_i)
where:
- P_i is ground truth probability
- Q_i is predicted probability
6. Generation Theory
6.1 Constrained Generation
- Constraint Satisfaction
where C_i are constraintsP(y|x,z) = P(y|x,z)⋅∏_i I[C_i(y)]
6.2 Fusion Mechanisms
- Soft Fusion
where:h_fused = σ(W_q⋅q + W_d⋅d + W_c⋅c)
- q is query representation
- d is document representation
- c is cross-attention features
7. Theoretical Metrics and Evaluation
7.1 Information-Theoretic Metrics
- Retrieval Quality
where:RQ = I(Z;Y|X)/H(Y|X)
- I is mutual information
- H is entropy
7.2 Generation Quality
- Faithfulness Metric
where sim is semantic similarityFaith(y,z) = min_i sim(y_i, z)
8. Future Theoretical Directions
8.1 Continuous Learning
- Knowledge Update Theory
where:K_t+1 = η⋅K_new + (1-η)⋅K_t
- K_t is current knowledge
- K_new is new knowledge
- η is learning rate
8.2 Multi-Modal RAG
- Cross-Modal Attention
where:A(q,v,t) = softmax(q^T[v;t])
-
q is query
-
v is visual features
-
t is textual features
-
Advanced RAG Implementation Guide - Extended Features
Table of Contents
- Alternative Architectures
- Query Transformation
- Multi-Vector Retrieval
- Self-Querying
- Hypothetical Document Embeddings
- Agent-Based RAG
- Evaluation Framework
Alternative Architectures
1. Multi-Stage RAG
class MultiStageRAG:
def __init__(self, embedding_model, llm_model):
self.embedding_model = embedding_model
self.llm_model = llm_model
def process_query(self, query: str):
# Stage 1: Query Understanding
expanded_query = self.expand_query(query)
# Stage 2: Coarse Retrieval
candidate_docs = self.coarse_retrieval(expanded_query)
# Stage 3: Fine-grained Retrieval
relevant_chunks = self.fine_retrieval(candidate_docs, query)
# Stage 4: Generation with Progressive Refinement
initial_response = self.generate_initial_response(relevant_chunks, query)
final_response = self.refine_response(initial_response, query)
return final_response
def expand_query(self, query: str):
prompt = f"""Generate multiple search queries to find relevant information for: {query}
Include different phrasings and related concepts."""
return self.llm_model.generate(prompt)
2. Fusion Techniques
class RAGFusion:
def __init__(self):
self.retrievers = []
def add_retriever(self, retriever, weight=1.0):
self.retrievers.append((retriever, weight))
def reciprocal_rank_fusion(self, rankings_list, k=60):
"""Implements reciprocal rank fusion"""
fused_scores = defaultdict(float)
for rankings in rankings_list:
for rank, doc_id in enumerate(rankings):
fused_scores[doc_id] += 1.0 / (rank + k)
return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
Query Transformation
1. Query Decomposition
class QueryDecomposer:
def __init__(self, llm):
self.llm = llm
def decompose_query(self, complex_query: str) -> List[str]:
prompt = f"""Break down this complex query into simpler sub-queries:
Query: {complex_query}
Return as a list of distinct questions that together help answer the main query."""
sub_queries = self.llm.generate(prompt)
return self.parse_sub_queries(sub_queries)
def execute_sub_queries(self, sub_queries: List[str], retriever):
results = []
for query in sub_queries:
docs = retriever.get_relevant_documents(query)
results.append((query, docs))
return self.synthesize_results(results)
2. Hypothetical Document Embeddings (HyDE)
class HyDERetriever:
def __init__(self, llm, embedding_model, vectorstore):
self.llm = llm
self.embedding_model = embedding_model
self.vectorstore = vectorstore
def generate_hypothetical_doc(self, query: str) -> str:
prompt = f"""Generate a hypothetical document passage that would perfectly answer this query:
Query: {query}
Passage:"""
return self.llm.generate(prompt)
def retrieve(self, query: str, k: int = 3):
# Generate hypothetical document
hyde_doc = self.generate_hypothetical_doc(query)
# Get embedding for hypothetical document
hyde_embedding = self.embedding_model.embed(hyde_doc)
# Retrieve similar documents
return self.vectorstore.similarity_search_by_vector(hyde_embedding, k)
Advanced Context Processing
1. Dynamic Context Window
class DynamicContextWindow:
def __init__(self, max_tokens: int):
self.max_tokens = max_tokens
def optimize_context(self, retrieved_docs: List[str], query: str) -> str:
# Score document relevance
scored_docs = [(doc, self.score_relevance(doc, query))
for doc in retrieved_docs]
# Sort by relevance
sorted_docs = sorted(scored_docs, key=lambda x: x[1], reverse=True)
# Build context within token limit
context = []
current_tokens = 0
for doc, score in sorted_docs:
doc_tokens = len(self.tokenize(doc))
if current_tokens + doc_tokens <= self.max_tokens:
context.append(doc)
current_tokens += doc_tokens
else:
break
return self.format_context(context)
2. Contextual Compression
class ContextualCompressor:
def __init__(self, llm):
self.llm = llm
def compress_contexts(self, contexts: List[str], query: str) -> str:
"""Compress retrieved contexts while maintaining relevant information"""
prompt = f"""
Query: {query}
Contexts:
{self.format_contexts(contexts)}
Create a compressed version that maintains all information relevant to the query."""
return self.llm.generate(prompt)
def format_contexts(self, contexts: List[str]) -> str:
return "\n\n---\n\n".join(contexts)
Evaluation Framework
1. RAG Metrics Implementation
class RAGEvaluator:
def __init__(self):
self.metrics = {}
def evaluate_faithfulness(self, response: str, contexts: List[str]) -> float:
"""Evaluate if response is supported by retrieved contexts"""
# Implementation of faithfulness scoring
pass
def evaluate_relevance(self, retrieved_docs: List[str], query: str) -> float:
"""Evaluate relevance of retrieved documents"""
# Implementation of relevance scoring
pass
def evaluate_answer_correctness(self,
generated_answer: str,
ground_truth: str) -> float:
"""Evaluate correctness of generated answer"""
# Implementation of answer correctness scoring
pass
def evaluate_retrieval_diversity(self, retrieved_docs: List[str]) -> float:
"""Evaluate diversity of retrieved documents"""
# Implementation of diversity scoring
pass
2. A/B Testing Framework
class RAGABTesting:
def __init__(self, variant_a, variant_b):
self.variant_a = variant_a
self.variant_b = variant_b
self.results = defaultdict(list)
def run_test(self, test_queries: List[str], metrics: List[str]):
for query in test_queries:
# Run both variants
result_a = self.variant_a.process_query(query)
result_b = self.variant_b.process_query(query)
# Evaluate results
for metric in metrics:
score_a = self.evaluate_metric(result_a, metric)
score_b = self.evaluate_metric(result_b, metric)
self.results[metric].append({
'variant_a': score_a,
'variant_b': score_b
})
def analyze_results(self):
"""Statistical analysis of A/B test results"""
return self.compute_statistics(self.results)
Security and Privacy Features
1. Data Access Control
class SecureRAG:
def __init__(self, base_rag, access_control):
self.base_rag = base_rag
self.access_control = access_control
def retrieve(self, query: str, user_context: Dict):
# Check access permissions
allowed_docs = self.access_control.filter_documents(user_context)
# Perform retrieval on allowed documents only
return self.base_rag.retrieve(query, document_filter=allowed_docs)
2. Privacy-Preserving RAG
class PrivacyPreservingRAG:
def __init__(self, anonymizer, base_rag):
self.anonymizer = anonymizer
self.base_rag = base_rag
def process_document(self, document: str):
# Anonymize sensitive information
anonymized_doc = self.anonymizer.anonymize(document)
return self.base_rag.process_document(anonymized_doc)
def generate_response(self, query: str, contexts: List[str]):
# De-anonymize response while maintaining privacy
anonymized_response = self.base_rag.generate_response(query, contexts)
return self.anonymizer.deanonymize(anonymized_response)