Retrieval-Augmented Generation has become the go-to pattern for building AI features that need to stay grounded in real data. After shipping several RAG systems to production, here are the lessons that documentation doesn't prepare you for.
Chunking Is the Bottleneck
Everyone focuses on the LLM and the vector database. But the quality of your retrieval — and therefore your answers — is almost entirely determined by how you chunk your documents.
Fixed-size token splits with overlap are a reasonable starting point, but they fall apart with structured documents. A CV, for example, has clear sections (Experience, Skills, Education) that carry semantic meaning. Splitting mid-section destroys context.
What worked for me: Detect headings first (regex heuristics are fine for v1), then split within sections. Preserve the section label as metadata on each chunk. When you retrieve, the LLM gets chunks that are self-contained and labeled.
Embeddings Are Cheap — Use Them Generously
With text-embedding-3-small costing fractions of a cent per thousand tokens, there's no reason to be stingy. Embed the section label alongside the chunk text. Embed alternative phrasings of common queries as "synthetic" chunks. The retrieval improvement is significant.
Prompt Engineering Is Iterative
Your first system prompt will be wrong. Your second one will be better but still wrong. Version your prompts in a database table, not in code. Let non-developers tweak them. Build a simple admin UI for prompt editing — it pays for itself in the first week.
The most important rule in the system prompt: tell the LLM what it should do when it doesn't know the answer. Without this, it will hallucinate confidently.
pgvector Is Enough
For most use cases — especially single-document corpora or small knowledge bases — pgvector in PostgreSQL is more than sufficient. You don't need Pinecone, Qdrant, or Weaviate. One database, one backup, one connection string.
The HNSW index in pgvector 0.5+ handles thousands of vectors with sub-millisecond query times. For a typical project with 50-200 chunks, even a sequential scan is instantaneous.
Monitor What Matters
Log the latency of each step separately: embedding generation, vector search, LLM response. When something is slow, you'll know exactly where to look. Also log token counts — they're the primary cost driver and the first thing to optimize.
Don't store user queries in the database unless you have explicit consent and a retention policy. Data minimization isn't just a GDPR requirement — it's good engineering practice.
The Real Complexity Is in the Edges
The happy path works in a weekend. Production-readiness takes weeks. Handle these edge cases early:
- No document indexed yet — return a helpful fallback, not an error
- Query completely unrelated to the corpus — the LLM should politely decline, not stretch for an answer
- Very long queries — truncate or summarize before embedding
- Concurrent ingestion and querying — use database transactions to avoid serving stale chunks
RAG isn't hard to build. It's hard to build well. Start simple, measure everything, and iterate based on real usage patterns.