Interactive Demo

RAG Pipeline

Retrieval-Augmented Generation grounds large language models in your own data. Watch each stage of the pipeline come alive, from raw documents to a cited, accurate answer.

Chunk Strategy

Recursive Sentence Fixed-size

Recursive — chunk_size: 512 tokens, overlap: 64 tokens, split_on: paragraphs → sentences

Document

Source text

Chunk

Split text

Embed

Vectorize

Store

Vector DB

Retrieve

Similarity

Generate

LLM answer

Press Run Pipeline to watch RAG in action.
Scenario: querying a company handbook — "What's our remote work policy?"

Embedding Space (2D Projection)

Document chunks projected into 2D via t-SNE. During retrieval, the query vector appears and highlights the 3 nearest neighbors.

How It Works

Chunking Strategies

Documents are split into overlapping segments so that no context is lost at boundaries. Common approaches include fixed-size windows (e.g. 512 tokens with 64-token overlap), recursive splitting by paragraph and sentence boundaries, and semantic chunking that detects topic shifts via embedding distance.

Embedding Models

Each chunk is converted into a dense vector (e.g. 1536 dimensions for text-embedding-3-small). These vectors capture semantic meaning so that "remote work" and "work from home" land close together in the vector space, even though the words differ.

Similarity Search

At query time the user's question is embedded with the same model and compared against every stored vector. Cosine similarity or dot product finds the top-k nearest chunks. Approximate-nearest-neighbor indices (HNSW, IVF) make this fast even at millions of vectors.

Prompt Augmentation

Retrieved chunks are injected into the LLM prompt as context: "Answer using ONLY the context below." This grounds the model in factual data, dramatically reducing hallucinations and letting it cite specific passages from your documents.

Deep Dive

Naive RAG

The simplest pipeline: index documents, retrieve top-k chunks at query time, and concatenate them into a single LLM prompt. Fast to build but sensitive to chunk boundaries and retrieval noise.

Single retrieval step with no re-ranking
Flat document store (no metadata filtering)
Best for small corpora with well-structured documents

Advanced RAG

Adds pre-retrieval and post-retrieval optimizations: query rewriting, hybrid search (keyword + vector), re-ranking with a cross-encoder, and context compression to fit within the LLM's window.

Query expansion via HyDE (Hypothetical Document Embeddings)
Cross-encoder re-ranking (e.g. ms-marco-MiniLM) to refine top-k
Metadata filters (date, source, section) narrow the search space

Modular RAG

A composable architecture where retrieval, generation, and validation are independent modules. Enables routing (choose retriever per query type), iterative retrieval (multi-hop reasoning), and self-reflective generation with citation verification.

Router selects retriever (vector, graph, SQL) based on query classification
Iterative retrieval: LLM identifies missing info and triggers additional lookups
Critic module validates generated answers against source chunks

A minimal Python RAG pipeline using LangChain-style patterns:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# 1. Chunk the document
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(document_text)

# 2. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_texts(chunks, embeddings)

# 3. Retrieve relevant chunks
query = "What is the remote work policy?"
results = vectorstore.similarity_search_with_score(
    query, k=3
)

# 4. Build augmented prompt
context = "\n---\n".join([doc.page_content for doc, _ in results])
prompt = f"""Answer using ONLY the context below.

Context:
{context}

Question: {query}
Answer:"""

# 5. Generate with LLM
response = llm.invoke(prompt)

Key parameters to tune: chunk_size (larger = more context per chunk, fewer chunks), chunk_overlap (prevents losing context at boundaries), and k (number of retrieved chunks).

Hallucination Despite Context

Even with retrieved chunks, LLMs can hallucinate when the context is ambiguous or when the model over-generalizes. Mitigation strategies include explicit "cite your source" instructions, chunk-level attribution, and confidence scoring.

Stale Data

If the vector store is not kept in sync with source documents, the LLM will answer based on outdated information. Solutions include incremental indexing, TTL-based cache invalidation, and versioned document stores that track when each chunk was last updated.

Context Window Limits

Retrieving too many chunks can exceed the LLM's context window or degrade response quality as the model struggles with "lost in the middle" effects. Best practices:

Use re-ranking to promote the most relevant chunks to the top
Apply context compression (summarize less-relevant chunks)
Consider map-reduce: summarize each chunk independently, then synthesize
For very large retrieval sets, use iterative refinement instead of single-shot