Embeddings on Chen Kai Blog

LLM Engineering (8): Retrieval-Augmented Generation

Fri, 03 Apr 2026 09:00:00 +0000

RAG is the most over-deployed and under-engineered pattern in LLM applications. The 2024 demo loop — embed everything with text-embedding-3-large, dump into pgvector, top-5 cosine — works for 1000 documents and a forgiving demo. It does not survive 100K real documents and a customer who notices when the answer is wrong. This chapter is what I wish more teams knew before they built their second generation of RAG.

The original RAG paper (Lewis et al., 2020 ) framed retrieval-augmented generation as a hybrid model: a dense retriever (DPR) trained jointly with a generator (BART) so the retrieval objective optimized end-task accuracy. Production RAG in 2026 doesn’t look much like Lewis’s RAG — modern systems use frozen pre-trained embedders, separate rerankers, and decoder-only generators that don’t train against the retriever. But the core insight (parameterize knowledge separately from reasoning) survived and became the dominant paradigm. The Gao et al. (2023) RAG survey is the best comprehensive overview of the post-2020 evolution into “Naive RAG → Advanced RAG → Modular RAG.”

Recommendation Systems (3): Deep Learning Foundations

Sun, 07 Dec 2025 09:00:00 +0000

In June 2016, Google published a one-page paper that quietly redrew the map of recommendation systems. The paper described Wide & Deep Learning, the model then powering app recommendations inside Google Play — a billion-user product. Within a year, every major tech company had a deep model in production. By 2019, the industry standard had shifted: matrix factorization was a baseline, not a system.

What changed? Multi-layer neural networks brought four capabilities classical methods could not deliver:

NLP (10): RAG and Knowledge Enhancement Systems

Sat, 15 Nov 2025 09:00:00 +0000

A frozen language model is a confident liar. It can’t read yesterday’s incident report, your company wiki, or the patch notes that shipped this morning, so when you ask, it confabulates an answer that is grammatically perfect but factually wrong. Retrieval-Augmented Generation (RAG) breaks the deadlock by separating memory from reasoning: keep the LLM small and stable, and put the volatile knowledge in an external store that you can update anytime. Before generating, retrieve the relevant evidence and condition the model on it.