LLM Engineering (8): Retrieval-Augmented Generation

Fri, 03 Apr 2026 09:00:00 +0000

RAG is the most over-deployed and under-engineered pattern in LLM applications. The 2024 demo loop — embed everything with text-embedding-3-large, dump into pgvector, top-5 cosine — works for 1000 documents and a forgiving demo. It does not survive 100K real documents and a customer who notices when the answer is wrong. This chapter is what I wish more teams knew before they built their second generation of RAG.

The original RAG paper (Lewis et al., 2020 ) framed retrieval-augmented generation as a hybrid model: a dense retriever (DPR) trained jointly with a generator (BART) so the retrieval objective optimized end-task accuracy. Production RAG in 2026 doesn’t look much like Lewis’s RAG — modern systems use frozen pre-trained embedders, separate rerankers, and decoder-only generators that don’t train against the retriever. But the core insight (parameterize knowledge separately from reasoning) survived and became the dominant paradigm. The Gao et al. (2023) RAG survey is the best comprehensive overview of the post-2020 evolution into “Naive RAG → Advanced RAG → Modular RAG.”

Reranking on Chen Kai Blog

LLM Engineering (8): Retrieval-Augmented Generation