Production RAG: Retrieval-Augmented Generation Beyond the Demo
A demo RAG works on a thousand documents. Production RAG fails on a million. Here are the engineering patterns that close the gap.

Why demo RAG breaks at scale
A naive RAG pipeline embeds every document chunk, stores them in a vector database, retrieves the top-k by cosine similarity, and stuffs them into a prompt. On a thousand documents this gives a convincing demo. At a million it produces hallucinations, irrelevant retrieval, and a cost structure that does not survive contact with a product manager.
Production RAG is a search system that happens to feed an LLM. The hard problems are search problems: ranking, recall, deduplication, freshness.
Chunking is the highest-leverage decision
The unit of retrieval is the chunk. Chunk too small and you lose context; chunk too large and you dilute the embedding's signal. The right answer depends on the document structure: legal contracts want semantic chunks (clauses), technical docs want section-based chunks, transcripts want speaker-turn chunks.
Late chunking — embedding the full document and then chunking the embedding output — outperforms naive chunking for long documents. It is more compute-expensive at ingest and worth it at retrieval.
Hybrid retrieval beats pure vector search
Pure vector search misses exact-match queries (product SKUs, error codes, function names). Pure keyword search misses paraphrase. Hybrid retrieval — combine BM25 lexical scores with vector cosine scores via reciprocal rank fusion — outperforms either alone in every benchmark we have run.
Most modern vector databases (Weaviate, Qdrant, Pinecone) support hybrid natively. Use it.
Reranking is non-negotiable at scale
Top-k retrieval gives you candidates. A cross-encoder reranker (Cohere Rerank, BGE, or a hosted ColBERT) re-scores those candidates against the query and returns the actually-most-relevant subset. The latency cost is 50–200ms; the precision improvement is large.
Without reranking, you either retrieve too few documents (and miss relevant content) or stuff too many into the context (and pay for tokens that hurt rather than help).
Grounding and citation
Every claim in the LLM's output should be grounded in a retrieved chunk and ideally cited back. Without grounding, the model fabricates plausibly. With grounding, the user can audit.
The prompt pattern is simple: 'Answer only using the provided sources. If the sources do not contain the answer, say so. Cite the source ID after each claim.' Combined with a model that respects the instruction (current Claude, GPT, Gemini all do), this dramatically reduces hallucination.
Evaluation is the missing discipline
Every production RAG system needs an evaluation harness — a set of representative queries with expected answers, run on every change to chunking, retrieval, or prompts. Without this, you ship regressions silently.
RAGAS, TruLens, and Phoenix are the current open-source frameworks. For domain-specific evaluation, hand-curated golden sets are still the gold standard.
Reader questions, answered
Which vector database should we use?+
Postgres + pgvector is fine up to tens of millions of vectors. Above that, Qdrant, Weaviate, and Pinecone are credible. Pick by operational fit, not benchmarks.
Do we need fine-tuning?+
Usually not for RAG. Fine-tune only when the model lacks domain vocabulary or response style that few-shot prompting cannot fix.

Raza Ahmad is a technology author and IT infrastructure specialist based in Melbourne, Australia. He writes practitioner-grade guides on cloud computing (Azure and AWS), cybersecurity, enterprise networking with Cisco platforms, Linux administration, DevOps, and virtualization. His work focuses on translating complex infrastructure topics into clear, accurate guidance that engineers, system administrators, and IT decision makers can put to work in production environments. Every article published under his byline is fact-checked against current vendor documentation, official standards, and Raza's own hands-on experience operating the technologies he covers.
More from Artificial Intelligence

Getting Started with Large Language Models: A Practical Guide for Engineers
What you actually need to know about tokens, embeddings, RAG, and evaluation to ship LLM features that hold up in production.

The Complete AI Guide for IT Professionals in 2026
How IT teams should think about artificial intelligence — practical use cases, security and governance considerations, and the platforms that matter.

LLM Evaluation Frameworks Compared: Measuring Quality Without Ground Truth
How to evaluate LLM systems when there is no single right answer — the techniques, the frameworks, and the trade-offs.
One email. The technology stories that actually matter for engineers.
A curated digest of the week's most useful tutorials, reviews, and analysis — no clickbait, no AI summaries of someone else's work.
Free. Unsubscribe anytime. See our privacy policy.