Deep Learning on Chen Kai Blog

ML Math Derivations (19): Neural Networks and Backpropagation

Sat, 07 Feb 2026 09:00:00 +0000

Hook. In 1969 Minsky and Papert proved that a single perceptron could not learn XOR, and connectionist research went into a fifteen-year freeze. The thaw came when Rumelhart, Hinton and Williams realised that stacking perceptrons makes the problem disappear — and that the same chain rule everyone learns in calculus, applied carefully, computes every gradient in a multilayer network for the cost of a single extra forward pass. That algorithm is backpropagation. Every gradient in every Transformer, every diffusion model, every GPT trained today still runs on it.

Recommendation Systems (4): CTR Prediction and Click-Through Rate Modeling

Wed, 10 Dec 2025 09:00:00 +0000

Every time you scroll through a social-media feed, click a product recommendation, or watch a suggested video, a CTR (click-through rate) model decides what to show you. These models answer one deceptively small question:

“What is the probability that this specific user will click on this specific item, right now?”

Behind that question lies one of the most economically valuable problems in machine learning. A 1% lift in CTR translates into millions of dollars at the scale of Google, Amazon, or Alibaba — and the same models also drive video feeds, app stores, news apps, and dating apps. CTR prediction sits at the heart of the ranking stage: candidate generation gives you a few thousand items, and the CTR model decides which dozen actually reach the user.

Recommendation Systems (3): Deep Learning Foundations

Sun, 07 Dec 2025 09:00:00 +0000

In June 2016, Google published a one-page paper that quietly redrew the map of recommendation systems. The paper described Wide & Deep Learning, the model then powering app recommendations inside Google Play — a billion-user product. Within a year, every major tech company had a deep model in production. By 2019, the industry standard had shifted: matrix factorization was a baseline, not a system.

What changed? Multi-layer neural networks brought four capabilities classical methods could not deliver:

NLP (6): GPT and Generative Language Models

Sun, 26 Oct 2025 09:00:00 +0000

When you ask ChatGPT a question and a fluent multi-paragraph answer streams back token by token, you are watching a single deceptively simple loop: feed everything-so-far into a Transformer decoder, look at the probability distribution it produces over the vocabulary, pick one token, append it, repeat. That is all an autoregressive language model does. The miracle is not the loop — it is what happens when you scale the network behind the loop to hundreds of billions of parameters and train it on most of the internet.

NLP (5): BERT and Pretrained Models

Tue, 21 Oct 2025 09:00:00 +0000

In October 2018, Google released BERT and broke eleven NLP benchmarks at once. The recipe is almost embarrassingly simple: take a Transformer encoder, train it to predict words that have been randomly hidden using both left and right context, and then fine-tune the same pretrained model for whatever downstream task you have. Before BERT, every task came with its own from-scratch model. After BERT, “pretrain once, fine-tune everywhere” became the default mental model for the entire field.

NLP (4): Attention Mechanism and Transformer

Thu, 16 Oct 2025 09:00:00 +0000

In June 2017, eight researchers at Google Brain and Google Research published a paper with a deliberately bold title: Attention Is All You Need. The architecture it introduced, the Transformer, threw away recurrence entirely. There were no LSTMs, no GRUs, no left-to-right scanning of a sentence. Instead, every token in a sequence could look at every other token directly through a single mathematical operation: scaled dot-product attention.

That one design decision unlocked massive parallelism on GPUs, eliminated the long-range dependency problems that had plagued RNNs for decades, and became the substrate on which BERT, GPT, T5, LLaMA, Claude, and essentially every modern large language model is built. If you understand this article well, the rest of the series is mostly variations on a theme.

NLP (3): RNN and Sequence Modeling

Sat, 11 Oct 2025 09:00:00 +0000

Open Google Translate, swipe-type a message, or dictate a memo to your phone — all these systems consume an ordered stream of tokens and produce another. A feed-forward network processes each input independently, but language is fundamentally sequential: the meaning of “mat” in the cat sat on the mat depends on every word that came before. Recurrent Neural Networks (RNNs) handle this by maintaining a hidden state that evolves as they process each token. The hidden state is the network’s running summary of the past — its memory.

NLP (2): Word Embeddings and Language Models

Mon, 06 Oct 2025 09:00:00 +0000

\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}

The entire trajectory of NLP shifted toward representation learning. This article walks through that shift—from the failure of one-hot vectors, to Word2Vec’s shallow networks, to the global statistics that GloVe exploits, to the subword n-grams that let FastText handle unseen words—and finally connects embeddings to the language models that gave rise to them.

NLP (1): Introduction and Text Preprocessing

Wed, 01 Oct 2025 09:00:00 +0000

Every time you ask Claude a question, autocomplete a sentence in Gmail, or read a Google Translate page, you’re using a stack that took seventy years to build. Natural Language Processing (NLP) is the field that taught machines to read, score, transform, and write human language. Surprisingly, much of the modern NLP stack still relies on preprocessing techniques from decades ago.

This first article in the series does two things. First, it maps out the field’s history, current scope, and the reasons behind the tools we use. Second, it builds the foundational layer — cleaning, tokenization, normalization, and feature extraction — with code you can use directly in a project. By the end, you’ll have a reusable preprocessing pipeline and, more importantly, an understanding of when each step is helpful and when it can destroy signal.

Reparameterization Trick & Gumbel-Softmax: A Deep Dive

Wed, 30 Jul 2025 09:00:00 +0000

The moment your model contains a sampling step, training hits a hard wall: how do gradients flow through a random node?

The reparameterization trick has a clean answer — rewrite $z\sim p_\theta(z)$ as $z=g_\theta(\epsilon)$ , isolating the randomness in a parameter-free noise variable $\epsilon$ , so backprop can flow through $g_\theta$ . The trouble starts with discrete variables: operations like $\arg\max$ are not differentiable. Gumbel-Softmax (a.k.a. the Concrete distribution) replaces the discrete sample with a tempered softmax over Gumbel-perturbed logits, giving you a smooth, differentiable surrogate that you can train end-to-end.

Transfer Learning (6): Multi-Task Learning

Sat, 31 May 2025 09:00:00 +0000

A self-driving car using a single camera needs to do three things simultaneously: detect cars and pedestrians, segment lanes and free space, and estimate the distance of each pixel. Training three separate networks would triple the parameters, require three times as many forward passes at inference, and overlook the fact that all three tasks need the same low-level features (edges, surfaces, occlusion cues).

Multi-task learning (MTL) is the alternative: one shared backbone, one task-specific head per output, all trained jointly. Done well, you cut parameters by 60% and lift accuracy on every task because each task acts as a regularizer for the others. Done badly, two of your three tasks regress and you waste a week wondering why.

Transfer Learning (5): Knowledge Distillation

Sun, 25 May 2025 09:00:00 +0000

You have a 340M-parameter BERT model that hits 95% accuracy. The product team wants it on a phone that can barely fit 10M parameters. Training a 10M model from scratch lands at 85%. Knowledge distillation closes most of the gap: train the small model on the output distribution of the large one, not just on the labels, and you can reach 92%.

The key insight, due to Hinton, is that a teacher’s “wrong” predictions are not noise — they are information. When the teacher classifies a cat image and assigns 0.14 to “tiger”, 0.07 to “dog”, and 0.008 to “plane”, it is telling you that cats look a lot like tigers, somewhat like dogs, and nothing like aeroplanes. That structure — dark knowledge — is invisible in a one-hot label, and learning it is what lets the student punch above its weight.

Transfer Learning (4): Few-Shot Learning

Mon, 19 May 2025 09:00:00 +0000

Show a child one photograph of a pangolin and they will spot pangolins for life. Show a deep learning model one photograph and it will give you a uniformly random guess. Few-shot learning is the field that closes that gap: building classifiers that work with only one to ten labeled examples per class.

The trick is not to memorize individual classes harder. It is to learn how to learn from very few examples, then carry that ability over to brand-new classes at test time. This article covers the two families that dominate the field today: metric learning, which learns a good distance function, and meta-learning, which learns a good initialization.

Transfer Learning (3): Domain Adaptation

Tue, 13 May 2025 09:00:00 +0000

Your autonomous-driving stack works perfectly on sunny California freeways. Then it rains in Seattle. Top-1 accuracy drops from 95% to 70%. The model did not get worse — the data distribution shifted, and your training set never told it what wet asphalt looks like at dusk.

This is the everyday problem of domain adaptation: you have abundant labelled data in one distribution (the source) and unlabelled data in another (the target), and you need the model to perform on the target. This article shows you how, from first-principles theory to a working DANN implementation.

Transfer Learning (2): Pre-training and Fine-tuning

Wed, 07 May 2025 09:00:00 +0000

BERT changed NLP overnight. A model pre-trained on Wikipedia and BookCorpus could be fine-tuned on a few thousand labelled examples and beat task-specific architectures that researchers had spent years hand-crafting. The same pattern repeated in vision (ImageNet pre-training, then SimCLR, MAE), in speech (wav2vec 2.0), and in code (Codex). Today, “pre-train once, fine-tune everywhere” is the default recipe of modern deep learning.

But why does pre-training work? When should you freeze layers, when should you LoRA, and how small does your learning rate need to be? This article unpacks both the theory and the engineering practice behind the most successful transfer paradigm we have.

Transfer Learning (1): Fundamentals and Core Concepts

Thu, 01 May 2025 09:00:00 +0000

You spent two weeks training an ImageNet classifier on a rack of GPUs. On Monday morning, your team lead asks for a chest X-ray pneumonia model, and the entire labeled dataset is two hundred images. Do you book another two weeks of GPU time and start from scratch?

Of course not. You use what the ImageNet model already knows about edges, textures, and shapes, swap out the last layer, and fine-tune on the X-rays. Two hours later, you have a model that beats anything you could have trained from random weights with so little data. That’s transfer learning, and it’s why most real-world deep learning projects ship in days instead of months.

Essence of Linear Algebra (16): Linear Algebra in Deep Learning

Wed, 16 Apr 2025 09:00:00 +0000

Strip away the marketing and a deep network is one thing: a long pipeline of matrix multiplications glued together by elementwise nonlinearities. Forward pass, backward pass, convolution, attention, normalization, fine-tuning — every “trick” is a small twist on the same algebraic theme. Once you see the matrices, the field stops looking like a bag of recipes and starts looking like a single language.

This chapter rebuilds the modern stack from that single language. We follow one signal — a vector $\mathbf{x}$ — as it flows through linear layers, gets convolved, gets attended to, gets normalized, and gets adapted by a low-rank update. At each step we name the matrix that does the work and the property of that matrix (rank, conditioning, transpose) that makes the trick succeed.

Essence of Linear Algebra (13): Tensors and Multilinear Algebra

Wed, 26 Mar 2025 09:00:00 +0000

If you’ve used PyTorch or TensorFlow, you’ve met the word “tensor” hundreds of times. PyTorch calls every array torch.Tensor; TensorFlow puts it in the product name. But what is a tensor, and why did frameworks borrow this physics-flavored word for what looks like a multi-dimensional array?

The short answer from this chapter:

A tensor is the natural generalization of a scalar, vector, and matrix to arbitrary dimensions. Everything you know about matrices either lifts cleanly to tensors, or breaks in instructive ways.

Time Series Forecasting (8): Informer — Efficient Long-Sequence Forecasting

Sun, 15 Dec 2024 09:00:00 +0000

The Transformer is wonderful at sequence modeling — right up to the moment your sequence gets long. Vanilla self-attention costs $\mathcal{O}(L^2)$ in both compute and memory, so a one-week hourly window (168 steps) is fine, a one-month window (720 steps) is painful, and a three-month window (2160 steps) is essentially impossible on a single GPU. That is exactly the regime real-world long-horizon forecasting lives in: weather, energy, finance, IoT.

Time Series Forecasting (7): N-BEATS — Interpretable Deep Architecture

Sat, 30 Nov 2024 09:00:00 +0000

The 2018 M4 forecasting competition served 100,000 series across six frequencies as a single benchmark. The leaderboard was dominated by hand-tuned ensembles built from decades of statistical-forecasting craft. Then a pure neural network with no statistical preprocessing, no feature engineering, and no recurrence won outright. That network was N-BEATS by Oreshkin et al. — a stack of fully-connected blocks with two residual paths. Its interpretable variant additionally split the forecast into a polynomial trend and a Fourier seasonality, so the very thing classical statisticians wanted (a readable decomposition) came for free.

Time Series Forecasting (6): Temporal Convolutional Networks (TCN)

Fri, 15 Nov 2024 09:00:00 +0000

For most of the 2010s, saying “deep learning for time series” meant using LSTM. The story changed in 2018 when Bai, Kolter, and Koltun published An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. Their result was surprisingly simple: use a stack of 1-D convolutions, make them causal (no peeking at the future), space the filter taps exponentially (dilation), wrap the whole thing in residual connections, and train. Task after task, the resulting Temporal Convolutional Network (TCN) matched or beat LSTM/GRU — while training several times faster because every time step in the forward pass runs in parallel.

Time Series Forecasting (5): Transformer Architecture for Time Series

Thu, 31 Oct 2024 09:00:00 +0000

The 2017 Attention Is All You Need paper took the attention mechanism from the previous chapter to its logical extreme: drop the RNN entirely. Transformers stack pure attention into a full sequence model — no recurrence, no hidden state propagating over time. Originally designed for machine translation, the architecture was quickly adapted to every other sequence task, time series included.

Dropping a vanilla NLP Transformer onto a time-series problem runs into two immediate complications. The first is position. Attention is a set operation — shuffle the input order and the output is unchanged. For a time series, order is everything: a temperature curve that goes up-then-down and one that goes down-then-up are entirely different signals. NLP solves this with sinusoidal position encodings; do those still make sense for time series, or should we use learned encodings, or just concatenate calendar features (hour-of-day, day-of-week) directly into the input?

Time Series Forecasting (4): Attention Mechanisms — Direct Long-Range Dependencies

Wed, 16 Oct 2024 09:00:00 +0000

RNNs and LSTMs handled “too many time steps” but left a subtler limitation in place: information has to travel step by step. For step 100 to see what happened at step 1, the signal has to ride the hidden state through 99 intermediate stops — and each stop attenuates the signal a little and squashes it through a nonlinearity. Even with LSTM’s “highway” cell state, it’s still a single lane in a single direction.

Time Series Forecasting (3): GRU — Lightweight Gates and Efficiency Trade-offs

Tue, 01 Oct 2024 09:00:00 +0000

After you’ve used LSTM for a while, an obvious question shows up: aren’t three gates a bit much? The forget and input gates seem to do related work — one decides what to drop, the other decides what to add — couldn’t they be merged? And does the cell state really need to be a separate vector from the hidden state, or could the hidden state do double duty?

That is exactly the question Cho et al. answered in 2014 with the Gated Recurrent Unit. They collapsed three gates into two: an update gate that controls how much of the old state to keep versus how much new content to absorb, and a reset gate that decides whether to ignore the old state entirely when computing a fresh candidate. The cell state is folded back into the hidden state. The result is roughly 25% fewer parameters, training that runs 10-15% faster, and accuracy on most time-series tasks that is statistically indistinguishable from LSTM.

Time Series Forecasting (2): LSTM — Gate Mechanisms and Long-Term Dependencies

Mon, 16 Sep 2024 09:00:00 +0000

The first RNN I ever trained, back in 2017, was a small sales forecaster: 50 days in, the next day out. The forward pass ran cleanly, the loss went down, and yet the model had near-total amnesia about anything older than three days. The data had a clear monthly cycle. The model couldn’t see it. I assumed I needed more data, so I added rows and layers — and watched the training loss jump to NaN halfway through epoch two.

Position Encoding Brief: From Sinusoidal to RoPE and ALiBi

Fri, 30 Jun 2023 09:00:00 +0000

Self-attention has a strange property that surprises most people the first time they compute it by hand: it does not know the order of its inputs. Permute the tokens and every attention score is permuted along with them — the function is exactly equivariant. So before we can do anything useful with a Transformer, we have to inject position information from the outside.

That single design decision — how to inject it — has spawned a remarkable amount of research. Sinusoidal, learned, relative, T5-style buckets, RoPE, ALiBi, NoPE, and more. This post is a practitioner’s brief: enough math to know why each scheme works, enough comparison to choose one, and a clear focus on the property that matters most in the LLM era — length extrapolation, the ability to handle sequences longer than anything seen in training.

Variational Autoencoder (VAE): From Intuition to Implementation and Troubleshooting

Tue, 27 Jun 2023 09:00:00 +0000

A plain autoencoder compresses and reconstructs. A variational autoencoder learns something far more useful: a smooth, structured latent space you can sample from to generate genuinely new data. That single change — making the encoder output a distribution instead of a vector — turns the network from a fancy compressor into a generative model with a tractable likelihood lower bound.

This guide walks the full path: why autoencoders fail at generation, how the ELBO derivation gets you to the loss function, why the reparameterization trick is the trick that makes everything trainable, a complete PyTorch implementation, and a tour of every common failure mode with concrete fixes.

Optimization (4): Learning Rate and Schedules

Sun, 18 Sep 2022 09:00:00 +0000

Your model diverges. You halve the learning rate. Now it trains, but takes forever. You halve again — now the loss is a flat line. Sound familiar? Of all the knobs you can turn, learning rate is the one that most often decides whether training converges, crawls, or blows up. This guide gives you the intuition, the minimal math, and a practical workflow to get it right — from a 12-layer CNN on your laptop to a 70B-parameter LLM on a thousand GPUs.

Optimization (3): The Gradient Descent Family from SGD to AdamW

Fri, 16 Sep 2022 09:00:00 +0000

Why is “tuning the LR is an art” a meme for ResNet, while every modern LLM paper just writes “AdamW, $\beta_1{=}0.9, \beta_2{=}0.95, \mathrm{wd}{=}0.1$ ” and moves on? It is not an accident — it is the end-point of three decades of optimizer evolution.

This post walks the lineage end-to-end on a single thread: each step exists because of a specific failure of the previous one. We end with the three directions that have actually entered the post-2023 large-model toolkit: Lion, Sophia, and Schedule-Free.

Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide

Thu, 30 Dec 2021 09:00:00 +0000

In 2026, why are you still reading about kernel methods? Aren’t transformers supposed to have eaten the entire ML stack? Yes and no. Transformers eat the headlines, but kernels still eat the corners — the regimes with 200 samples, the regimes where the model has to publish calibrated error bars, the regimes where a physicist needs to know which basis function caused the prediction. This final part is the field manual: when kernels actually win, how to debug them when they don’t, and how to bolt them on top of a neural network when you want the best of both worlds.