RAG & Knowledge Systems

Retrieval-Augmented Generation (RAG) lets AI models work with your private data — codebases, documents, wikis — without fine-tuning.

How RAG Works

User Query
    │
    ▼
┌──────────┐     ┌──────────────┐
│ Embedding │────▶│ Vector Search │
│  Model    │     │  (top-k)     │
└──────────┘     └──────┬───────┘
                        │
                   Retrieved Chunks
                        │
                        ▼
                 ┌──────────────┐
                 │   LLM with   │
                 │   Context    │──▶ Answer
                 └──────────────┘
  1. Index — split documents into chunks, generate embeddings, store in a vector database
  2. Retrieve — embed the user's query, find the most similar chunks
  3. Generate — pass retrieved chunks as context to the LLM

Chunking Strategies

How you split documents dramatically affects quality:

# Fixed-size chunks (simple but splits mid-sentence)
def fixed_chunks(text: str, size: int = 500, overlap: int = 50) -> list[str]:
    chunks = []
    for i in range(0, len(text), size - overlap):
        chunks.append(text[i:i + size])
    return chunks

# Semantic chunking (split on headings/paragraphs)
def semantic_chunks(markdown: str) -> list[str]:
    sections = markdown.split('\n## ')
    return [s.strip() for s in sections if s.strip()]

Rules of thumb: - Chunk size: 200-500 tokens works well for most use cases - Overlap: 10-20% prevents losing context at boundaries - Semantic splitting (by heading, paragraph) outperforms fixed-size

Vector Databases

Database Best For Notes
ChromaDB Local dev, prototyping Embedded, no server needed
Pinecone Production, managed Serverless option available
pgvector Already using Postgres Extension, no extra infra
Qdrant Self-hosted production Rich filtering support

Embedding Models

# Using Voyager (Anthropic's embedding model concept)
# Or use OpenAI's text-embedding-3-small for cost efficiency

import voyager  # hypothetical

embeddings = voyager.embed(["How do I deploy Django?"])

Improving RAG Quality

  1. Hybrid search — combine vector similarity with keyword (BM25) search
  2. Re-ranking — use a cross-encoder to re-score retrieved chunks
  3. Metadata filtering — filter by date, category, or source before vector search
  4. Query expansion — rewrite the user's query to capture more relevant results
# Simple re-ranking with an LLM
def rerank(query: str, chunks: list[str], top_k: int = 3) -> list[str]:
    scored = []
    for chunk in chunks:
        score = llm.score_relevance(query, chunk)  # 0-1
        scored.append((score, chunk))
    scored.sort(reverse=True)
    return [chunk for _, chunk in scored[:top_k]]

When RAG Isn't Enough

  • Structured data — use SQL or an API, not RAG
  • Real-time data — RAG indexes are stale; use live API calls
  • Complex reasoning — RAG retrieves facts but doesn't synthesize well across many documents
  • Consider fine-tuning when RAG can't capture domain-specific reasoning patterns