Advanced180 min read

RAG (Retrieval-Augmented Generation) Systems

Build RAG systems that combine LLMs with external knowledge bases. Learn vector databases, embeddings, and advanced retrieval techniques.

Topics Covered:

RAG ArchitectureVector DatabasesEmbeddingsRetrieval StrategiesEvaluation

Prerequisites:

  • LangChain experience
  • Understanding of embeddings
  • Database knowledge

Overview

RAG (Retrieval-Augmented Generation) systems combine the power of LLMs with external knowledge bases, enabling AI applications to answer questions using specific documents or data. This advanced tutorial covers building production-ready RAG systems.

Understanding RAG Architecture

RAG systems solve the problem of LLM knowledge limitations by retrieving relevant information before generating responses. RAG Components: 1. Document Loader: Loads and processes documents 2. Text Splitter: Chunks documents into manageable pieces 3. Embeddings: Converts text to vector representations 4. Vector Store: Stores and searches embeddings 5. Retriever: Finds relevant documents for queries 6. LLM: Generates responses using retrieved context RAG Flow: 1. User asks question 2. System converts question to embedding 3. Searches vector store for similar documents 4. Retrieves top-k relevant chunks 5. Passes chunks + question to LLM 6. LLM generates answer using context Benefits: • Answers questions about specific documents • Can cite sources • More accurate than LLM alone • Can be updated with new information • Reduces hallucinations

Working with Embeddings

Embeddings convert text into numerical vectors that capture semantic meaning. Key Concepts: • Embeddings are dense vectors (arrays of numbers) • Similar texts have similar embeddings • Distance between vectors = semantic similarity • Common dimensions: 1536 (OpenAI), 768, 1024 Embedding Models: • OpenAI text-embedding-ada-002 (most common) • Sentence Transformers (open source) • Cohere embeddings • Custom fine-tuned models Best Practices: • Use same model for indexing and querying • Normalize embeddings for better similarity • Consider embedding dimensions vs. accuracy • Batch embeddings for efficiency

Code Example:
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize embeddings
embeddings = OpenAIEmbeddings()

# Create text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

# Load and split documents
documents = text_splitter.split_documents(loaded_docs)

# Generate embeddings
# This happens automatically when adding to vector store
# But you can also do it manually:

text = "This is a sample document about AI Engineering."
embedding = embeddings.embed_query(text)
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")

# Batch embeddings (more efficient)
texts = ["Document 1", "Document 2", "Document 3"]
batch_embeddings = embeddings.embed_documents(texts)

# Compare similarity
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

query = "What is AI Engineering?"
query_embedding = embeddings.embed_query(query)

# Compare with document embeddings
similarities = []
for doc_embedding in batch_embeddings:
    similarity = cosine_similarity(
        [query_embedding],
        [doc_embedding]
    )[0][0]
    similarities.append(similarity)

# Find most similar
most_similar_idx = np.argmax(similarities)
print(f"Most similar document: {texts[most_similar_idx]}")

Embeddings convert text to vectors. Similar texts have similar vectors. Use cosine similarity to find relevant documents for queries.

Vector Databases and Storage

Vector databases efficiently store and search high-dimensional vectors. Popular Options: • Pinecone: Managed, scalable, production-ready • Weaviate: Open source, self-hosted • Chroma: Lightweight, easy to use • FAISS: Facebook's library, in-memory • Qdrant: Open source, performant • Milvus: Enterprise-grade, scalable Choosing a Vector DB: • Scale requirements • Latency needs • Budget (managed vs. self-hosted) • Feature requirements (metadata filtering, etc.) • Team expertise

Code Example:
# Using Pinecone (managed)
from langchain.vectorstores import Pinecone
import pinecone

# Initialize Pinecone
pinecone.init(
    api_key="your-api-key",
    environment="us-west1-gcp"
)

# Create index
index_name = "ai-engineering-docs"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=1536,  # OpenAI embedding dimension
        metric="cosine"
    )

# Create vector store
vectorstore = Pinecone.from_documents(
    documents,
    embeddings,
    index_name=index_name
)

# Using Chroma (local, easy)
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Using FAISS (in-memory)
from langchain.vectorstores import FAISS

vectorstore = FAISS.from_documents(
    documents=documents,
    embedding=embeddings
)

# Save and load
vectorstore.save_local("./faiss_index")
loaded_vectorstore = FAISS.load_local(
    "./faiss_index",
    embeddings
)

Vector databases store embeddings and enable fast similarity search. Choose based on your needs: Pinecone for production scale, Chroma for simplicity, FAISS for in-memory.

Building a Complete RAG System

Putting it all together: a complete RAG system with LangChain. Components: • Document loading • Text splitting • Embedding generation • Vector storage • Retrieval • Generation

Code Example:
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# 1. Load documents
loader = PyPDFLoader("ai_engineering_guide.pdf")
documents = loader.load()

# 2. Split documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
texts = text_splitter.split_documents(documents)

# 3. Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=texts,
    embedding=embeddings,
    persist_directory="./rag_db"
)

# 4. Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # Retrieve top 4 documents
)

# 5. Create QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(temperature=0),
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

# 6. Query the system
query = "What are the key skills for AI Engineers?"
result = qa_chain({"query": query})

print(f"Answer: {result['result']}")
print(f"Sources: {len(result['source_documents'])} documents")

# Advanced: Using chat models with memory
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

conversational_chain = ConversationalRetrievalChain.from_llm(
    llm=ChatOpenAI(temperature=0),
    retriever=retriever,
    memory=memory
)

# Multi-turn conversation
result1 = conversational_chain({"question": "What is AI Engineering?"})
result2 = conversational_chain({"question": "What skills do I need?"})  # Can reference previous context

This is a complete RAG system. Documents are loaded, split, embedded, stored, and retrieved. The QA chain combines retrieval with generation for accurate answers.

Advanced Retrieval Strategies

Improving retrieval quality is crucial for RAG performance. Techniques: • Hybrid search (keyword + semantic) • Re-ranking results • Metadata filtering • Query expansion • Multi-query retrieval Evaluation: • Measure retrieval accuracy • Test with diverse queries • Monitor user feedback • A/B test different strategies

Code Example:
# Hybrid search (combining keyword and semantic)
from langchain.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# Semantic retriever
vectorstore_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Keyword retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(texts)
bm25_retriever.k = 10

# Combine both
ensemble_retriever = EnsembleRetriever(
    retrievers=[vectorstore_retriever, bm25_retriever],
    weights=[0.5, 0.5]  # Equal weight
)

# Re-ranking results
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Compress and re-rank
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore_retriever
)

# Metadata filtering
vectorstore_with_metadata = Chroma.from_documents(
    documents,
    embeddings,
    persist_directory="./rag_db"
)

# Filter by metadata
filtered_retriever = vectorstore_with_metadata.as_retriever(
    search_kwargs={
        "k": 4,
        "filter": {"category": "tutorial", "level": "beginner"}
    }
)

# Multi-query retrieval
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore_retriever,
    llm=llm
)

# Generates multiple query variations and retrieves for each

Advanced retrieval strategies improve RAG performance. Hybrid search combines keyword and semantic, re-ranking improves relevance, and metadata filtering enables precise retrieval.

Conclusion

RAG systems enable AI applications to answer questions using specific knowledge bases. Master embeddings, vector databases, and retrieval strategies to build production-ready RAG systems. Remember: good retrieval is the foundation of good RAG - invest time in optimizing your retrieval pipeline.