RAG & Knowledge Systems
Retrieval-Augmented Generation (RAG) lets AI models work with your private data — codebases, documents, wikis — without fine-tuning.
How RAG Works
User Query
│
▼
┌──────────┐ ┌──────────────┐
│ Embedding │────▶│ Vector Search │
│ Model │ │ (top-k) │
└──────────┘ └──────┬───────┘
│
Retrieved Chunks
│
▼
┌──────────────┐
│ LLM with │
│ Context │──▶ Answer
└──────────────┘
- Index — split documents into chunks, generate embeddings, store in a vector database
- Retrieve — embed the user's query, find the most similar chunks
- Generate — pass retrieved chunks as context to the LLM
Chunking Strategies
How you split documents dramatically affects quality:
# Fixed-size chunks (simple but splits mid-sentence)
def fixed_chunks(text: str, size: int = 500, overlap: int = 50) -> list[str]:
chunks = []
for i in range(0, len(text), size - overlap):
chunks.append(text[i:i + size])
return chunks
# Semantic chunking (split on headings/paragraphs)
def semantic_chunks(markdown: str) -> list[str]:
sections = markdown.split('\n## ')
return [s.strip() for s in sections if s.strip()]
Rules of thumb: - Chunk size: 200-500 tokens works well for most use cases - Overlap: 10-20% prevents losing context at boundaries - Semantic splitting (by heading, paragraph) outperforms fixed-size
Vector Databases
| Database | Best For | Notes |
|---|---|---|
| ChromaDB | Local dev, prototyping | Embedded, no server needed |
| Pinecone | Production, managed | Serverless option available |
| pgvector | Already using Postgres | Extension, no extra infra |
| Qdrant | Self-hosted production | Rich filtering support |
Embedding Models
# Using Voyager (Anthropic's embedding model concept)
# Or use OpenAI's text-embedding-3-small for cost efficiency
import voyager # hypothetical
embeddings = voyager.embed(["How do I deploy Django?"])
Improving RAG Quality
- Hybrid search — combine vector similarity with keyword (BM25) search
- Re-ranking — use a cross-encoder to re-score retrieved chunks
- Metadata filtering — filter by date, category, or source before vector search
- Query expansion — rewrite the user's query to capture more relevant results
# Simple re-ranking with an LLM
def rerank(query: str, chunks: list[str], top_k: int = 3) -> list[str]:
scored = []
for chunk in chunks:
score = llm.score_relevance(query, chunk) # 0-1
scored.append((score, chunk))
scored.sort(reverse=True)
return [chunk for _, chunk in scored[:top_k]]
When RAG Isn't Enough
- Structured data — use SQL or an API, not RAG
- Real-time data — RAG indexes are stale; use live API calls
- Complex reasoning — RAG retrieves facts but doesn't synthesize well across many documents
- Consider fine-tuning when RAG can't capture domain-specific reasoning patterns