RAG on Chen Kai Blog

Alibaba Cloud Full Stack (9): OpenSearch and AI Search

Wed, 06 May 2026 09:00:00 +0000

I built my first search engine with Elasticsearch and a pile of synonyms. It took six months to get decent results. Every week, users complained about missing results, so I added more synonyms, broke something else, and added exception rules. The relevance tuning spreadsheet grew to 400 rows. I had custom analyzers for three languages, a boosting config that no one understood (including me), and a reindexing job that took four hours. Then I tried hybrid vector+keyword search on a side project and got better results on day one. Not marginally better — “users stopped complaining” better. That experience completely changed how I think about search, and it’s the reason this article exists.

LLM Engineering (8): Retrieval-Augmented Generation

Fri, 03 Apr 2026 09:00:00 +0000

RAG is the most over-deployed and under-engineered pattern in LLM applications. The 2024 demo loop — embed everything with text-embedding-3-large, dump into pgvector, top-5 cosine — works for 1000 documents and a forgiving demo. It does not survive 100K real documents and a customer who notices when the answer is wrong. This chapter is what I wish more teams knew before they built their second generation of RAG.

The original RAG paper (Lewis et al., 2020 ) framed retrieval-augmented generation as a hybrid model: a dense retriever (DPR) trained jointly with a generator (BART) so the retrieval objective optimized end-task accuracy. Production RAG in 2026 doesn’t look much like Lewis’s RAG — modern systems use frozen pre-trained embedders, separate rerankers, and decoder-only generators that don’t train against the retriever. But the core insight (parameterize knowledge separately from reasoning) survived and became the dominant paradigm. The Gao et al. (2023) RAG survey is the best comprehensive overview of the post-2020 evolution into “Naive RAG → Advanced RAG → Modular RAG.”

NLP (10): RAG and Knowledge Enhancement Systems

Sat, 15 Nov 2025 09:00:00 +0000

A frozen language model is a confident liar. It can’t read yesterday’s incident report, your company wiki, or the patch notes that shipped this morning, so when you ask, it confabulates an answer that is grammatically perfect but factually wrong. Retrieval-Augmented Generation (RAG) breaks the deadlock by separating memory from reasoning: keep the LLM small and stable, and put the volatile knowledge in an external store that you can update anytime. Before generating, retrieve the relevant evidence and condition the model on it.

LLM Workflows and Application Architecture: Enterprise Implementation Guide

Thu, 31 Jul 2025 09:00:00 +0000

Most LLM tutorials end where the interesting work begins. They show you how to call a chat completion endpoint, attach a vector store, and wrap the whole thing in a Streamlit demo. None of that is wrong, but none of it is what breaks at 3 a.m. when 10,000 users hit your service at once and every other answer is a hallucination.

This article is about everything that comes after the demo. It is opinionated on purpose: production LLM systems are mostly plain distributed systems with one non-deterministic component bolted on, and most of the engineering effort goes into containing that non-determinism. We will work through seven dimensions — application architecture, workflow patterns, the RAG-vs-fine-tune decision, deployment topology, cost, observability, and enterprise integration — keeping each one short, concrete, and grounded in the levers that actually move the needle.