RAG & Knowledge Systems

Retrieval-Augmented Generation (RAG) lets AI models work with your private data — codebases, documents, wikis — without fine-tuning.

How RAG Works

User Query
    │
    ▼
┌──────────┐     ┌──────────────┐
│ Embedding │────▶│ Vector Search │
│  Model    │     │  (top-k)     │
└──────────┘     └──────┬───────┘
                        │
                   Retrieved Chunks
                        │
                        ▼
                 ┌──────────────┐
                 │   LLM with   │
                 │   Context    │──▶ Answer
                 └──────────────┘

Index — split documents into chunks, generate embeddings, store in a vector database
Retrieve — embed the user's query, find the most similar chunks
Generate — pass retrieved chunks as context to the LLM

Chunking Strategies

How you split documents dramatically affects quality:

# Fixed-size chunks (simple but splits mid-sentence)
def fixed_chunks(text: str, size: int = 500, overlap: int = 50) -> list[str]:
    chunks = []
    for i in range(0, len(text), size - overlap):
        chunks.append(text[i:i + size])
    return chunks

# Semantic chunking (split on headings/paragraphs)
def semantic_chunks(markdown: str) -> list[str]:
    sections = markdown.split('\n## ')
    return [s.strip() for s in sections if s.strip()]

Rules of thumb: - Chunk size: 200-500 tokens works well for most use cases - Overlap: 10-20% prevents losing context at boundaries - Semantic splitting (by heading, paragraph) outperforms fixed-size

Vector Databases

Database	Best For	Notes
ChromaDB	Local dev, prototyping	Embedded, no server needed
Pinecone	Production, managed	Serverless option available
pgvector	Already using Postgres	Extension, no extra infra
Qdrant	Self-hosted production	Rich filtering support

Embedding Models

# Using Voyager (Anthropic's embedding model concept)
# Or use OpenAI's text-embedding-3-small for cost efficiency

import voyager  # hypothetical

embeddings = voyager.embed(["How do I deploy Django?"])

Improving RAG Quality

Hybrid search — combine vector similarity with keyword (BM25) search
Re-ranking — use a cross-encoder to re-score retrieved chunks
Metadata filtering — filter by date, category, or source before vector search
Query expansion — rewrite the user's query to capture more relevant results

# Simple re-ranking with an LLM
def rerank(query: str, chunks: list[str], top_k: int = 3) -> list[str]:
    scored = []
    for chunk in chunks:
        score = llm.score_relevance(query, chunk)  # 0-1
        scored.append((score, chunk))
    scored.sort(reverse=True)
    return [chunk for _, chunk in scored[:top_k]]

When RAG Isn't Enough

Structured data — use SQL or an API, not RAG
Real-time data — RAG indexes are stale; use live API calls
Complex reasoning — RAG retrieves facts but doesn't synthesize well across many documents
Consider fine-tuning when RAG can't capture domain-specific reasoning patterns