Machine Learning on Chen Kai Blog

ML Math Derivations (20): Regularization and Model Selection

Sun, 08 Feb 2026 09:00:00 +0000

What You Will Learn#

A 100-million-parameter network trained on 50,000 images should overfit catastrophically. Modern deep networks generalise anyway. Why? Two ingredients: regularisation (techniques that constrain capacity) and generalisation theory (mathematics that says when learning works at all). This article is the closing chapter of the series, and we use it to gather every tool we have built — least squares, MAP estimation, optimisation, EM, neural networks — and turn them on the deepest open question in the field: why does learning generalise?

ML Math Derivations (19): Neural Networks and Backpropagation

Sat, 07 Feb 2026 09:00:00 +0000

Hook. In 1969 Minsky and Papert proved that a single perceptron could not learn XOR, and connectionist research went into a fifteen-year freeze. The thaw came when Rumelhart, Hinton and Williams realised that stacking perceptrons makes the problem disappear — and that the same chain rule everyone learns in calculus, applied carefully, computes every gradient in a multilayer network for the cost of a single extra forward pass. That algorithm is backpropagation. Every gradient in every Transformer, every diffusion model, every GPT trained today still runs on it.

ML Math Derivations (18): Clustering Algorithms

Fri, 06 Feb 2026 09:00:00 +0000

What You Will Learn#

A million customer records arrive with no labels. Can you discover meaningful groups automatically? That is clustering, the most fundamental unsupervised learning task. Unlike classification, clustering forces you to first answer a slippery question: what does “similar” even mean? Every clustering algorithm is, at heart, a different answer to that question — a different geometric, probabilistic, or graph-theoretic prior on what a “group” is.

ML Math Derivations (17): Dimensionality Reduction and PCA

Thu, 05 Feb 2026 09:00:00 +0000

What You Will Learn#

Feed a clustering algorithm $$10{,}000$$ -dimensional data and it will most likely fail — not because the algorithm is broken, but because high-dimensional space is a hostile environment for distance-based learning. Volumes evaporate into thin shells, the ratio of nearest- to farthest-neighbour distances tends to $$1$$ , and “closeness” stops carrying information. Dimensionality reduction is the response: project the data into a lower-dimensional space while keeping the structure that actually matters.

ML Math Derivations (16): Conditional Random Fields

Wed, 04 Feb 2026 09:00:00 +0000

What You Will Learn#

Named entity recognition, POS tagging, information extraction — every one of these tasks asks you to label each element of a sequence. HMMs (Part 15 ) attack this problem generatively by modelling the joint distribution $P(\mathbf{X},\mathbf{Y})$ , but to make the joint factorise they pay a steep price: each observation is assumed independent of everything except its own hidden label. In real text, whether bank is a noun or a verb depends on the preceding and following words, the suffix, capitalization, and dictionary lookups — all these features together.

ML Math Derivations (15): Hidden Markov Models

Tue, 03 Feb 2026 09:00:00 +0000

You hear footsteps behind you in the fog. You can’t see the walker, only the sounds. From the rhythm and pitch — short, soft, hurried — can you guess whether they are walking, running, or limping? And if you observed an entire sequence, which gait sequence is most likely? How likely is any sequence of sounds under your model of how walking works?

These are the three problems of HMMs, and the surprise is that all three reduce to one trick: write the joint $P(\mathbf{O}, \mathbf{I})$ as a product of local factors along time, then share sub-computations across time with dynamic programming. Brute force costs $$O(N^T)$$ . Forward-Backward, Viterbi, and Baum-Welch all cost $$O(N^2 T)$$ . The exponent collapses because the Markov assumption makes the future conditionally independent of the past given the present.

ML Math Derivations (14): Variational Inference and Variational EM

Mon, 02 Feb 2026 09:00:00 +0000

When the posterior $p(\mathbf{z}\mid\mathbf{x})$ is intractable, you have two roads. Sampling (MCMC) walks a Markov chain whose stationary distribution is the posterior — eventually exact, but slow and hard to diagnose. Variational inference (VI) instead picks a simple family $\mathcal{Q}$ of distributions and finds the member $q^\star\in\mathcal{Q}$ that lies closest to the true posterior. Inference becomes optimization, and the same machinery that fits a neural network now fits a Bayesian model.

This post derives VI from a single identity, builds the mean-field algorithm and CAVI from that identity, connects EM and variational EM as special cases, and ends with the reparameterization trick that turns the ELBO into a stochastic objective compatible with autodiff — the engine inside every VAE.

ML Math Derivations (13): EM Algorithm and GMM

Sun, 01 Feb 2026 09:00:00 +0000

When data has hidden structure — like an unobserved cluster label, a missing feature, or an unseen topic — maximum likelihood becomes challenging. The log of a sum has no closed form, and gradient methods get entangled with the latent variables. The EM algorithm sidesteps the difficulty with a deceptively simple idea: alternate between guessing the hidden variables under a posterior (E-step) and fitting the parameters as if those guesses were true (M-step). Each iteration is mathematically guaranteed to push the likelihood up. This post derives EM from first principles, proves the monotone-ascent property using Jensen’s inequality, and explores its most famous application: Gaussian Mixture Models (GMM) — the soft, elliptical generalization of K-means.

ML Math Derivations (12): XGBoost and LightGBM

Sat, 31 Jan 2026 09:00:00 +0000

XGBoost and LightGBM are the two libraries that quietly win most tabular-data battles — on Kaggle leaderboards, in fraud-detection pipelines, in ad ranking, in churn models. They share the same backbone (gradient-boosted trees, Part 11 ) but make very different engineering bets:

XGBoost sharpens the math: it brings the second derivative of the loss into the objective, regularises the tree itself, and turns split selection into a closed-form score.
LightGBM sharpens the systems: it bins features into a small histogram, grows trees leaf-by-leaf, throws away uninformative samples (GOSS) and bundles mutually exclusive sparse features (EFB).

The result is two tools that look interchangeable from the API but behave very differently when $$N$$ or $$d$$ becomes large. This post derives every formula behind those choices so you can read a tuning guide and know why each knob exists.

ML Math Derivations (11): Ensemble Learning

Fri, 30 Jan 2026 09:00:00 +0000

Why do mediocre classifiers in a committee outperform a single brilliant one? The answer is straightforward: averaging reduces variance, sequential reweighting reduces bias, and a bit of randomization breaks the correlation that would otherwise negate these benefits. This post delves into the math behind this — bias-variance decomposition, bootstrap aggregating, AdaBoost as forward stagewise minimization of exponential loss, and gradient boosting as gradient descent in function space.

By the end, you should be able to look at any ensemble method and say what it reduces, why it works, and when it fails.

ML Math Derivations (10): Semi-Naive Bayes and Bayesian Networks

Thu, 29 Jan 2026 09:00:00 +0000

Hook. Naive Bayes assumes every feature is conditionally independent given the class. It is a convenient lie — one that lets us train in a single pass over the data, but one that classifiers based on tree structures and small graphs can systematically beat by a few accuracy points on virtually every UCI benchmark. This part walks the spectrum from “no dependencies” (Naive Bayes) to “all dependencies” (full joint), showing the three sweet spots that practitioners actually use: SPODE, TAN and AODE. The same factorisation idea, taken to its general form, is the Bayesian network.

ML Math Derivations (9): Naive Bayes

Wed, 28 Jan 2026 09:00:00 +0000

Hook: A spam filter that trains in milliseconds, scales to a million features, has no hyperparameters worth tuning, and still beats much fancier models on short-text problems. Naive Bayes pulls this off by making one outrageous assumption — every feature is independent given the class — and refusing to apologise for it. The assumption is wrong on essentially every real dataset, yet the classifier works. Understanding why is a tour through generative modelling, MAP estimation, Dirichlet priors, and the bias–variance tradeoff. This article walks the entire path.

ML Math Derivations (8): Support Vector Machines

Tue, 27 Jan 2026 09:00:00 +0000

Hook. You have two clouds of points and infinitely many lines that separate them. Which line is “best”? SVM gives a startlingly geometric answer: the line that sits in the middle of the widest empty corridor between the two classes. Push that single idea through Lagrangian duality and it produces a sparse model (only the points on the corridor wall matter), a quadratic program with a global optimum, and — almost as a free gift — the kernel trick that lets the same linear machinery carve curved boundaries in infinite-dimensional spaces.

ML Math Derivations (7): Decision Trees

Mon, 26 Jan 2026 09:00:00 +0000

Hook. A decision tree mimics how humans actually decide things: ask a question, branch on the answer, ask the next question. The math under that intuition is surprisingly rich — entropy from information theory tells us which question to ask first, the Gini index gives a cheaper proxy that lands on essentially the same trees, and cost-complexity pruning gives a principled way to stop the tree from memorising noise. Almost every modern boosted ensemble (XGBoost, LightGBM, CatBoost) is just a clever sum of these objects, so getting the foundations right pays off many times over.

ML Math Derivations (6): Logistic Regression and Classification

Sun, 25 Jan 2026 09:00:00 +0000

Hook. Linear regression maps inputs to any real number — but what if the output has to be a probability between 0 and 1? Logistic regression solves this with one elegant trick: a sigmoid squashing function. Despite its name, logistic regression is a classification algorithm, and its math underpins every neuron in every modern neural network.

What You Will Learn#

Why sigmoid is the natural way to turn a real-valued score into a probability, and why its derivative is so clean.
How cross-entropy loss falls out of maximum likelihood estimation in two lines.
Why cross-entropy beats MSE for classification — a vanishing-gradient argument made visible.
The full gradient and Hessian for both binary and multi-class (softmax) cases, and why the loss is convex.
L1, L2 and elastic-net regularization, and the Bayesian priors hiding behind them.
Decision-boundary geometry and the threshold-free metrics (ROC / PR / AUC) that you actually need under class imbalance.

Prerequisites#

Calculus: chain rule, partial derivatives.
Linear algebra: matrix multiplication, transpose.
Probability: Bernoulli and categorical distributions, likelihood.
Familiarity with Part 5: Linear Regression .

From Linear Models to Probabilistic Classification#

The Problem with Raw Linear Output#

Linear regression gives us $\hat y = \mathbf{w}^\top \mathbf{x}$ , which is unbounded. For classification, two things go wrong:

ML Math Derivations (5): Linear Regression

Sat, 24 Jan 2026 09:00:00 +0000

Hook. In 1886 Francis Galton noticed something strange about heredity: children of unusually tall (or short) parents tended to be closer to the average than their parents were. He called this drift toward the mean regression, and the name stuck. The statistical curiosity grew up into the most consequential model in machine learning — not because linear regression is powerful on its own, but because almost every other algorithm (logistic regression, neural networks, kernel methods) is some twist on the same idea: fit a line, but in the right space.

ML Math Derivations (4): Convex Optimization Theory

Fri, 23 Jan 2026 09:00:00 +0000

What You Will Learn#

In 1947, George Dantzig proposed the simplex method for linear programming, and a working theory of optimization was born. Eight decades later, optimization has become the engine of machine learning: every model you train, from a one-line linear regression to a 70B-parameter language model, is the answer to some optimization problem.

ML Math Derivations (3): Probability Theory and Statistical Inference

Thu, 22 Jan 2026 09:00:00 +0000

What You Will Learn#

In 1912, Ronald Fisher introduced maximum likelihood estimation in a short paper that quietly redefined statistics. His insight was almost embarrassingly simple: if a parameter setting makes the observed data extremely likely, it is probably correct. Almost every modern learning algorithm — from logistic regression to large language models — descends from this idea.

ML Math Derivations (2): Linear Algebra and Matrix Theory

Wed, 21 Jan 2026 09:00:00 +0000

Why this chapter, and what’s different#

If you have already worked through a standard linear-algebra course you have seen most of these objects. This chapter is not that course. It is the ML practitioner’s slice of linear algebra: the half-dozen ideas that actually appear when you implement gradient descent, run PCA, train a neural net, or read a paper.

ML Math Derivations (1): Introduction and Mathematical Foundations

Tue, 20 Jan 2026 09:00:00 +0000

What this chapter does#

In 2005 Google Research showed, on a public benchmark, that a statistical translation model trained on raw bilingual text could outperform decades of carefully engineered linguistic rules. The conclusion was uncomfortable for the experts of the day, but mathematically liberating: a system that has never been told the rules of a language can still recover them, given enough examples. Why?

Transfer Learning (10): Continual Learning

Tue, 24 Jun 2025 09:00:00 +0000

You can teach yourself to play guitar this year and you will still remember how to ride a bike. A neural network cannot. Fine-tune a vision model on CIFAR-10 then on SVHN, evaluate it on CIFAR-10 again, and accuracy collapses to barely above chance. The phenomenon is called catastrophic forgetting, and overcoming it is the central problem of continual learning (CL): a learner that absorbs a stream of tasks $\mathcal{T}_1, \mathcal{T}_2, \ldots$ without re-accessing past data and without losing what it already knew.

Transfer Learning (9): Parameter-Efficient Fine-Tuning

Wed, 18 Jun 2025 09:00:00 +0000

How do you fine-tune a 175B-parameter model on a single GPU? Update only 0.1% of the parameters. Parameter-Efficient Fine-Tuning (PEFT) makes this possible — and on most benchmarks it matches full fine-tuning. This post derives the math behind LoRA, Adapter, Prefix-Tuning, Prompt-Tuning, BitFit and QLoRA, and gives you a single picture for choosing among them.

What You Will Learn#

Why the low-rank assumption holds for weight updates
LoRA: derivation, initialization, scaling, and weight merging
Adapter: bottleneck architecture and where to insert it
Prefix-Tuning vs Prompt-Tuning vs P-Tuning v2
QLoRA: how 4-bit quantisation gets a 65B model on one GPU
Method comparison and a selection guide grounded in GLUE numbers

Prerequisites#

Transformer architecture (attention, FFN, residual + LayerNorm)
Matrix decomposition basics (rank, SVD)
Transfer learning fundamentals (Parts 1-6)

The Full Fine-Tuning Problem#

\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})

For GPT-3 (175B params) this means roughly 700 GB of FP32 weights, plus gradients, plus optimiser states — and one full copy per task. Even after the model fits, the per-task storage and serving cost is brutal: 100 customers means 100 copies of a 700 GB checkpoint.