NLP on Chen Kai Blog

NLP (12): Frontiers and Practical Applications

Tue, 25 Nov 2025 09:00:00 +0000

We have spent eleven chapters climbing from raw text to multimodal foundation models. This twelfth and final chapter sits at the frontier and at the runway. It is where research stops being a paper and starts being a service: an LLM that calls tools, writes and debugs code, reasons through hundred-step problems, ingests a 200K-token contract, and serves a thousand concurrent users behind a FastAPI endpoint with p95 latency under 300 ms.

NLP (11): Multimodal Large Language Models

Thu, 20 Nov 2025 09:00:00 +0000

Humans never perceive the world in one channel at a time. We watch a chart while reading the caption, hear a tone of voice while reading a face, glance at a screenshot while debating a bug. Pure-text language models are deaf and blind to all of that. Multimodal Large Language Models (MLLMs) close the gap by aligning images, audio, and video into the same representation space the language model already speaks.

NLP (10): RAG and Knowledge Enhancement Systems

Sat, 15 Nov 2025 09:00:00 +0000

A frozen language model is a confident liar. It cannot read yesterday’s incident report, your company wiki, or the patch notes that shipped this morning, so when you ask, it confabulates an answer that is grammatically perfect and factually wrong. Retrieval-Augmented Generation (RAG) breaks the deadlock by separating memory from reasoning: keep the LLM small and stable, and put the volatile knowledge in an external store that you can update at any time. Before generating, retrieve the relevant evidence and condition the model on it.

NLP (9): Deep Dive into LLM Architecture

Mon, 10 Nov 2025 09:00:00 +0000

The 2017 Transformer paper drew one block. Every production LLM today still uses that diagram as a silhouette, but almost every internal piece has been replaced. Pre-norm replaced post-norm. RMSNorm replaced LayerNorm. SwiGLU replaced GELU. Rotary embeddings replaced sinusoids. Multi-head attention became grouped-query attention. The dense FFN sometimes became a sparse mixture of experts. And the inference loop is dominated by a data structure that doesn’t appear in the original paper at all: the KV cache.

NLP (8): Model Fine-tuning and PEFT

Wed, 05 Nov 2025 09:00:00 +0000

In 2020, fine-tuning a 7-billion-parameter language model was a project budget item: eight A100s, several days, and an engineer who knew how to babysit gradient checkpointing. In 2024, a graduate student does it on a laptop. The distance between those two worlds is almost entirely covered by one paper — Hu et al.’s LoRA (ICLR 2022) — and one follow-up — Dettmers et al.’s QLoRA (NeurIPS 2023).

The shift is not just engineering. Parameter-Efficient Fine-Tuning (PEFT) reframes what it means to “have a model.” Instead of one binary blob per task, you keep a single frozen base model and a directory of small adapter files, each a few tens of megabytes. Switching tasks becomes loading a new adapter; serving N domains becomes O(1) base + N · ε.

NLP (7): Prompt Engineering and In-Context Learning

Fri, 31 Oct 2025 09:00:00 +0000

The same model can produce a sharp answer or a confident hallucination. The difference is rarely the weights – it is the framing. A vague request like “analyze this text” gets you a generic summary; a prompt with a role, two clean examples, and a strict output schema gets you something a parser can consume. Prompt engineering is the discipline of turning that gap into a repeatable system instead of a lucky shot.

NLP Part 6: GPT and Generative Language Models

Sun, 26 Oct 2025 09:00:00 +0000

When you ask ChatGPT a question and a fluent multi-paragraph answer streams back token by token, you are watching a single deceptively simple loop: feed everything-so-far into a Transformer decoder, look at the probability distribution it produces over the vocabulary, pick one token, append it, repeat. That is all an autoregressive language model does. The miracle is not the loop – it is what happens when you scale the network behind the loop to hundreds of billions of parameters and train it on most of the internet.

NLP Part 5: BERT and Pretrained Models

Tue, 21 Oct 2025 09:00:00 +0000

In October 2018, Google released BERT and broke eleven NLP benchmarks at once. The recipe is almost embarrassingly simple: take a Transformer encoder, train it to predict words that have been randomly hidden using both left and right context, and then fine-tune the same pretrained model for whatever downstream task you have. Before BERT, every task came with its own from-scratch model. After BERT, “pretrain once, fine-tune everywhere” became the default mental model for the entire field.

NLP Part 4: Attention Mechanism and Transformer

Thu, 16 Oct 2025 09:00:00 +0000

In June 2017, eight researchers at Google Brain and Google Research published a paper with a deliberately bold title: Attention Is All You Need. The architecture it introduced, the Transformer, threw away recurrence entirely. There were no LSTMs, no GRUs, no left-to-right scanning of a sentence. Instead, every token in a sequence could look at every other token directly through a single mathematical operation: scaled dot-product attention.

That one design decision unlocked massive parallelism on GPUs, eliminated the long-range dependency problems that had plagued RNNs for decades, and became the substrate on which BERT, GPT, T5, LLaMA, Claude, and essentially every modern large language model is built. If you understand this article well, the rest of the series is mostly variations on a theme.

NLP Part 3: RNN and Sequence Modeling

Sat, 11 Oct 2025 09:00:00 +0000

Open Google Translate, swipe-type a message, dictate a memo to your phone — every one of these systems must consume an ordered stream of tokens and produce another. A feed-forward network treats each input independently, but language is fundamentally sequential: the meaning of “mat” in the cat sat on the mat depends on every word that came before. Recurrent Neural Networks (RNNs) handle this by maintaining a hidden state that evolves as they consume each token. The hidden state is the network’s running summary of the past — its memory.

NLP Part 2: Word Embeddings and Language Models

Mon, 06 Oct 2025 09:00:00 +0000

For decades, machines treated “king” and “queen” as unrelated symbols – nothing more than two distinct slots in a vocabulary list. Then a single idea changed everything: what if every word lived in a continuous space, and meaning was just a direction? Once that idea took hold, models could compute

$$\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}$$

and the entire trajectory of NLP turned toward representation learning. This article walks through that turn – from the failure of one-hot vectors, to Word2Vec’s shallow networks, to the global statistics that GloVe exploits, to the subword n-grams that let FastText handle words it has never seen – and finally connects embeddings to the language models that gave rise to them.

NLP Part 1: Introduction and Text Preprocessing

Wed, 01 Oct 2025 09:00:00 +0000

Every time you ask Claude a question, autocomplete a sentence in Gmail, or read a Google Translate page, you are touching a stack that took seventy years to assemble. Natural Language Processing is the discipline that taught machines to read, score, transform, and write human language – and the surprising thing is how much of the modern stack still rests on a small set of preprocessing primitives invented decades ago.