Interactive Demo
RAG Pipeline
Retrieval-Augmented Generation grounds large language models in your own data. Watch each stage of the pipeline come alive, from raw documents to a cited, accurate answer.
Scenario: querying a company handbook — "What's our remote work policy?"
How It Works
Chunking Strategies
Documents are split into overlapping segments so that no context is lost at boundaries.
Common approaches include fixed-size windows (e.g. 512 tokens with 64-token overlap),
recursive splitting by paragraph and sentence boundaries,
and semantic chunking that detects topic shifts via embedding distance.
Embedding Models
Each chunk is converted into a dense vector (e.g. 1536 dimensions for text-embedding-3-small).
These vectors capture semantic meaning so that "remote work" and "work from home" land close together
in the vector space, even though the words differ.
Similarity Search
At query time the user's question is embedded with the same model and compared against every stored vector.
Cosine similarity or dot product finds the top-k nearest chunks.
Approximate-nearest-neighbor indices (HNSW, IVF) make this fast even at millions of vectors.
Prompt Augmentation
Retrieved chunks are injected into the LLM prompt as context:
"Answer using ONLY the context below."
This grounds the model in factual data, dramatically reducing hallucinations
and letting it cite specific passages from your documents.
Deep Dive
Naive RAG
The simplest pipeline: index documents, retrieve top-k chunks at query time, and concatenate them into a single LLM prompt. Fast to build but sensitive to chunk boundaries and retrieval noise.
- Single retrieval step with no re-ranking
- Flat document store (no metadata filtering)
- Best for small corpora with well-structured documents
Advanced RAG
Adds pre-retrieval and post-retrieval optimizations: query rewriting, hybrid search (keyword + vector), re-ranking with a cross-encoder, and context compression to fit within the LLM's window.
- Query expansion via HyDE (Hypothetical Document Embeddings)
- Cross-encoder re-ranking (e.g.
ms-marco-MiniLM) to refine top-k - Metadata filters (date, source, section) narrow the search space
Modular RAG
A composable architecture where retrieval, generation, and validation are independent modules. Enables routing (choose retriever per query type), iterative retrieval (multi-hop reasoning), and self-reflective generation with citation verification.
- Router selects retriever (vector, graph, SQL) based on query classification
- Iterative retrieval: LLM identifies missing info and triggers additional lookups
- Critic module validates generated answers against source chunks
A minimal Python RAG pipeline using LangChain-style patterns:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# 1. Chunk the document
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(document_text)
# 2. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_texts(chunks, embeddings)
# 3. Retrieve relevant chunks
query = "What is the remote work policy?"
results = vectorstore.similarity_search_with_score(
query, k=3
)
# 4. Build augmented prompt
context = "\n---\n".join([doc.page_content for doc, _ in results])
prompt = f"""Answer using ONLY the context below.
Context:
{context}
Question: {query}
Answer:"""
# 5. Generate with LLM
response = llm.invoke(prompt)
Key parameters to tune: chunk_size (larger = more context per chunk, fewer chunks), chunk_overlap (prevents losing context at boundaries), and k (number of retrieved chunks).
Hallucination Despite Context
Even with retrieved chunks, LLMs can hallucinate when the context is ambiguous or when the model over-generalizes. Mitigation strategies include explicit "cite your source" instructions, chunk-level attribution, and confidence scoring.
Stale Data
If the vector store is not kept in sync with source documents, the LLM will answer based on outdated information. Solutions include incremental indexing, TTL-based cache invalidation, and versioned document stores that track when each chunk was last updated.
Context Window Limits
Retrieving too many chunks can exceed the LLM's context window or degrade response quality as the model struggles with "lost in the middle" effects. Best practices:
- Use re-ranking to promote the most relevant chunks to the top
- Apply context compression (summarize less-relevant chunks)
- Consider map-reduce: summarize each chunk independently, then synthesize
- For very large retrieval sets, use iterative refinement instead of single-shot