NLP on Chen Kai Blog

NLP (12): Frontiers and Practical Applications

Tue, 25 Nov 2025 09:00:00 +0000

We have spent eleven chapters climbing from raw text to multimodal foundation models. This twelfth and final chapter sits at the frontier and at the runway. It is where research stops being a paper and starts being a service: an LLM that calls tools, writes and debugs code, reasons through hundred-step problems, ingests a 200K-token contract, and serves a thousand concurrent users behind a FastAPI endpoint with p95 latency under 300 ms.

NLP (11): Multimodal Large Language Models

Thu, 20 Nov 2025 09:00:00 +0000

Humans never perceive the world in one channel at a time. We watch a chart while reading the caption, hear a tone of voice while reading a face, glance at a screenshot while debating a bug. Pure-text language models are deaf and blind to all of that. Multimodal Large Language Models (MLLMs) close the gap by aligning images, audio, and video into the same representation space the language model already speaks.

NLP (10): RAG and Knowledge Enhancement Systems

Sat, 15 Nov 2025 09:00:00 +0000

A frozen language model is a confident liar. It can’t read yesterday’s incident report, your company wiki, or the patch notes that shipped this morning, so when you ask, it confabulates an answer that is grammatically perfect but factually wrong. Retrieval-Augmented Generation (RAG) breaks the deadlock by separating memory from reasoning: keep the LLM small and stable, and put the volatile knowledge in an external store that you can update anytime. Before generating, retrieve the relevant evidence and condition the model on it.

NLP (9): Deep Dive into LLM Architecture

Mon, 10 Nov 2025 09:00:00 +0000

The 2017 Transformer paper drew one block. Every production LLM today still uses that diagram as a silhouette, but almost every internal piece has been replaced. Pre-norm replaced post-norm. RMSNorm replaced LayerNorm. SwiGLU replaced GELU. Rotary embeddings replaced sinusoids. Multi-head attention became grouped-query attention. The dense FFN sometimes became a sparse mixture of experts. And the inference loop is dominated by a data structure that doesn’t appear in the original paper at all: the KV cache.

NLP (8): Model Fine-tuning and PEFT

Wed, 05 Nov 2025 09:00:00 +0000

In 2020, fine-tuning a 7-billion-parameter language model was a project budget item: eight A100s, several days, and an engineer who knew how to babysit gradient checkpointing. In 2024, a graduate student does it on a laptop. The distance between those two worlds is almost entirely covered by one paper — Hu et al.’s LoRA (ICLR 2022) — and one follow-up — Dettmers et al.’s QLoRA (NeurIPS 2023).

The shift is not just engineering. Parameter-Efficient Fine-Tuning (PEFT) reframes what it means to “have a model.” Instead of one binary blob per task, you keep a single frozen base model and a directory of small adapter files, each a few tens of megabytes. Switching tasks becomes loading a new adapter; serving N domains becomes O(1) base + N · ε.

NLP (7): Prompt Engineering and In-Context Learning

Fri, 31 Oct 2025 09:00:00 +0000

The same model can produce a sharp answer or a confident hallucination. The difference lies in the framing, not the weights. A vague request like “analyze this text” yields a generic summary; a prompt with a role, two clear examples, and a strict output schema produces something a parser can use. Prompt engineering turns that gap into a repeatable system, not just a lucky shot.

In-Context Learning (ICL) is the mechanism that makes this work. When you include a few examples in the prompt, the model doesn’t retrain; it conditions its forward pass on those examples and effectively infers a task from them. Understanding ICL’s capabilities and limitations separates a developer who struggles with the model from one who guides it.

NLP (6): GPT and Generative Language Models

Sun, 26 Oct 2025 09:00:00 +0000

When you ask ChatGPT a question and a fluent multi-paragraph answer streams back token by token, you are watching a single deceptively simple loop: feed everything-so-far into a Transformer decoder, look at the probability distribution it produces over the vocabulary, pick one token, append it, repeat. That is all an autoregressive language model does. The miracle is not the loop — it is what happens when you scale the network behind the loop to hundreds of billions of parameters and train it on most of the internet.

NLP (5): BERT and Pretrained Models

Tue, 21 Oct 2025 09:00:00 +0000

In October 2018, Google released BERT and broke eleven NLP benchmarks at once. The recipe is almost embarrassingly simple: take a Transformer encoder, train it to predict words that have been randomly hidden using both left and right context, and then fine-tune the same pretrained model for whatever downstream task you have. Before BERT, every task came with its own from-scratch model. After BERT, “pretrain once, fine-tune everywhere” became the default mental model for the entire field.

NLP (4): Attention Mechanism and Transformer

Thu, 16 Oct 2025 09:00:00 +0000

In June 2017, eight researchers at Google Brain and Google Research published a paper with a deliberately bold title: Attention Is All You Need. The architecture it introduced, the Transformer, threw away recurrence entirely. There were no LSTMs, no GRUs, no left-to-right scanning of a sentence. Instead, every token in a sequence could look at every other token directly through a single mathematical operation: scaled dot-product attention.

That one design decision unlocked massive parallelism on GPUs, eliminated the long-range dependency problems that had plagued RNNs for decades, and became the substrate on which BERT, GPT, T5, LLaMA, Claude, and essentially every modern large language model is built. If you understand this article well, the rest of the series is mostly variations on a theme.

NLP (3): RNN and Sequence Modeling

Sat, 11 Oct 2025 09:00:00 +0000

Open Google Translate, swipe-type a message, or dictate a memo to your phone — all these systems consume an ordered stream of tokens and produce another. A feed-forward network processes each input independently, but language is fundamentally sequential: the meaning of “mat” in the cat sat on the mat depends on every word that came before. Recurrent Neural Networks (RNNs) handle this by maintaining a hidden state that evolves as they process each token. The hidden state is the network’s running summary of the past — its memory.

NLP (2): Word Embeddings and Language Models

Mon, 06 Oct 2025 09:00:00 +0000

\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}

The entire trajectory of NLP shifted toward representation learning. This article walks through that shift—from the failure of one-hot vectors, to Word2Vec’s shallow networks, to the global statistics that GloVe exploits, to the subword n-grams that let FastText handle unseen words—and finally connects embeddings to the language models that gave rise to them.

NLP (1): Introduction and Text Preprocessing

Wed, 01 Oct 2025 09:00:00 +0000

Every time you ask Claude a question, autocomplete a sentence in Gmail, or read a Google Translate page, you’re using a stack that took seventy years to build. Natural Language Processing (NLP) is the field that taught machines to read, score, transform, and write human language. Surprisingly, much of the modern NLP stack still relies on preprocessing techniques from decades ago.

This first article in the series does two things. First, it maps out the field’s history, current scope, and the reasons behind the tools we use. Second, it builds the foundational layer — cleaning, tokenization, normalization, and feature extraction — with code you can use directly in a project. By the end, you’ll have a reusable preprocessing pipeline and, more importantly, an understanding of when each step is helpful and when it can destroy signal.

Position Encoding Brief: From Sinusoidal to RoPE and ALiBi

Fri, 30 Jun 2023 09:00:00 +0000

Self-attention has a strange property that surprises most people the first time they compute it by hand: it does not know the order of its inputs. Permute the tokens and every attention score is permuted along with them — the function is exactly equivariant. So before we can do anything useful with a Transformer, we have to inject position information from the outside.

That single design decision — how to inject it — has spawned a remarkable amount of research. Sinusoidal, learned, relative, T5-style buckets, RoPE, ALiBi, NoPE, and more. This post is a practitioner’s brief: enough math to know why each scheme works, enough comparison to choose one, and a clear focus on the property that matters most in the LLM era — length extrapolation, the ability to handle sequences longer than anything seen in training.

Multimodal LLMs and Downstream Tasks: A Practitioner's Guide

Sat, 09 Apr 2022 09:00:00 +0000

Stuffing pixels, audio, and video into a language model so it can “see,” “hear,” and reason — that was a research curiosity before CLIP landed in 2021. Today it’s table stakes for most consumer-facing AI products. But shipping a Multimodal LLM (MLLM) in production turns out to be hard in places people rarely talk about. Almost never the vision encoder. Almost always these four:

Alignment. How does the language model “understand” what the vision encoder produces? Is the projector a 2-layer MLP or a Q-Former? Which parameters thaw during training?
Task framing. The same MLLM has to do captioning, VQA, grounding, OCR. Each needs a prompt template that doesn’t quietly drop several points of accuracy.
Cost. A 1024x1024 image becomes hundreds of visual tokens. Prefill is brutal. Stretch that to video and the bill goes vertical. Token compression, KV cache reuse, and batching are not optional.
Evaluation. A model that scores 80 on MMBench can still hallucinate confidently on your customer’s invoice. Public benchmarks are the easy part.

This post follows the natural research arc — architecture, model families, downstream tasks, fine-tuning, evaluation, deployment — and tries to be specific enough at each stop that you can act on it. Less “what’s possible,” more “what to actually pick.”