ML Math Derivations on Chen Kai Blog

ML Math Derivations (20): Regularization and Model Selection

Sun, 08 Feb 2026 09:00:00 +0000

What This Article Covers

A 100-million-parameter network trained on 50,000 images should overfit catastrophically. Modern deep networks generalise anyway. Why? Two ingredients: regularisation (techniques that constrain capacity) and generalisation theory (mathematics that says when learning works at all). This article is the closing chapter of the series, and we use it to gather every tool we have built — least squares, MAP estimation, optimisation, EM, neural networks — and turn them on the deepest open question in the field: why does learning generalise?

ML Math Derivations (19): Neural Networks and Backpropagation

Sat, 07 Feb 2026 09:00:00 +0000

What This Article Covers

A single perceptron cannot solve XOR. Stack enough of them with nonlinear activations and you obtain a universal function approximator. The remaining question is how such a network learns from data. The answer — backpropagation, an efficient application of the chain rule that recycles intermediate results during a single backward sweep — is the engine behind every deep learning library written in the last forty years. Understanding it mathematically reveals two further truths: why deep networks suffer from vanishing or exploding gradients, and why the choice of weight initialization is much less arbitrary than it first appears.

ML Math Derivations (18): Clustering Algorithms

Fri, 06 Feb 2026 09:00:00 +0000

What This Article Covers

A million customer records arrive with no labels. Can you discover meaningful groups automatically? That is clustering, the most fundamental unsupervised learning task. Unlike classification, clustering forces you to first answer a slippery question: what does “similar” even mean? Every clustering algorithm is, at heart, a different answer to that question – a different geometric, probabilistic, or graph-theoretic prior on what a “group” is.

What you will learn:

ML Math Derivations (17): Dimensionality Reduction and PCA

Thu, 05 Feb 2026 09:00:00 +0000

What This Article Covers

Feed a clustering algorithm $10{,}000$-dimensional data and it will most likely fail – not because the algorithm is broken, but because high-dimensional space is a hostile environment for distance-based learning. Volumes evaporate into thin shells, the ratio of nearest- to farthest-neighbour distances tends to $1$, and “closeness” stops carrying information. Dimensionality reduction is the response: project the data into a lower-dimensional space while keeping the structure that actually matters.

ML Math Derivations (16): Conditional Random Fields

Wed, 04 Feb 2026 09:00:00 +0000

What This Article Covers

Named entity recognition, POS tagging, information extraction – every one of these tasks asks you to label each element of a sequence. HMMs (Part 15 ) attack this problem generatively by modelling the joint distribution $P(\mathbf{X},\mathbf{Y})$, but to make the joint factorise they pay a steep price: each observation is assumed independent of everything except its own hidden label. In real text, whether bank is a noun or a verb depends on the preceding word, the following word, the suffix, capitalisation, dictionary lookups – all of these features at once.

Machine Learning Mathematical Derivations (15): Hidden Markov Models

Tue, 03 Feb 2026 09:00:00 +0000

You hear footsteps behind you in a fog. You cannot see the walker, only the sounds. From the rhythm and pitch – short, soft, hurried – can you guess whether they are walking, running, or limping? And if you observed an entire sequence, which gait sequence is most likely? How likely is any sequence of sounds under your model of how walking works?

These are the three problems of HMMs, and the surprise is that all three reduce to one trick: write the joint $P(\mathbf{O}, \mathbf{I})$ as a product of local factors along time, then share sub-computations across time with dynamic programming. Brute force costs $O(N^T)$. Forward-Backward, Viterbi, and Baum-Welch all cost $O(N^2 T)$. The exponent collapses because the Markov assumption makes the future conditionally independent of the past given the present.

Machine Learning Mathematical Derivations (14): Variational Inference and Variational EM

Mon, 02 Feb 2026 09:00:00 +0000

When the posterior $p(\mathbf{z}\mid\mathbf{x})$ is intractable, you have two roads. Sampling (MCMC) walks a Markov chain whose stationary distribution is the posterior — eventually exact, but slow and hard to diagnose. Variational inference (VI) instead picks a simple family $\mathcal{Q}$ of distributions and finds the member $q^\star\in\mathcal{Q}$ that lies closest to the true posterior. Inference becomes optimization, and the same machinery that fits a neural network now fits a Bayesian model.

Machine Learning Mathematical Derivations (13): EM Algorithm and GMM

Sun, 01 Feb 2026 09:00:00 +0000

When data carries hidden structure – a cluster label you never observed, a missing feature, a topic you cannot directly see – maximum likelihood becomes painful. The log of a sum has no closed form, and gradient methods get tangled in the latent variables. The EM algorithm sidesteps the difficulty with a deceptively simple idea: alternate between guessing the hidden variables under a posterior (E-step) and fitting the parameters as if those guesses were true (M-step). Each iteration is mathematically guaranteed to push the likelihood up. This post derives EM from first principles, proves the monotone-ascent property via Jensen’s inequality, and works through its most famous application: Gaussian Mixture Models (GMM) – the soft, elliptical generalisation of K-means.

Machine Learning Mathematical Derivations (12): XGBoost and LightGBM

Sat, 31 Jan 2026 09:00:00 +0000

XGBoost and LightGBM are the two libraries that quietly win most tabular-data battles — on Kaggle leaderboards, in fraud-detection pipelines, in ad ranking, in churn models. They share the same backbone (gradient-boosted trees, Part 11) but make very different engineering bets:

XGBoost sharpens the math: it brings the second derivative of the loss into the objective, regularises the tree itself, and turns split selection into a closed-form score.
LightGBM sharpens the systems: it bins features into a small histogram, grows trees leaf-by-leaf, throws away uninformative samples (GOSS) and bundles mutually exclusive sparse features (EFB).

The result is two tools that look interchangeable from the API but behave very differently when $N$ or $d$ becomes large. This post derives every formula behind those choices so you can read a tuning guide and know why each knob exists.

Machine Learning Mathematical Derivations (11): Ensemble Learning

Fri, 30 Jan 2026 09:00:00 +0000

Why does a committee of mediocre classifiers outperform a single brilliant one? The answer is unromantic but precise: averaging cuts variance, sequential reweighting cuts bias, and a little randomisation breaks the correlation that would otherwise destroy both effects. This post derives the mathematics behind that picture — bias–variance decomposition, bootstrap aggregating, AdaBoost as forward stagewise minimisation of exponential loss, and gradient boosting as gradient descent in function space.

By the end you should be able to look at any ensemble method and say what it is reducing, why it works, and when it will fail.

Machine Learning Mathematical Derivations (10): Semi-Naive Bayes and Bayesian Networks

Thu, 29 Jan 2026 09:00:00 +0000

Hook. Naive Bayes assumes every feature is conditionally independent given the class. It is a convenient lie – one that lets us train in a single pass over the data, but one that classifiers based on tree structures and small graphs can systematically beat by a few accuracy points on virtually every UCI benchmark. This part walks the spectrum from “no dependencies” (Naive Bayes) to “all dependencies” (full joint), showing the three sweet spots that practitioners actually use: SPODE, TAN and AODE. The same factorisation idea, taken to its general form, is the Bayesian network.

Machine Learning Mathematical Derivations (9): Naive Bayes

Wed, 28 Jan 2026 09:00:00 +0000

Hook: A spam filter that trains in milliseconds, scales to a million features, has no hyperparameters worth tuning, and still beats much fancier models on short-text problems. Naive Bayes pulls this off by making one outrageous assumption — every feature is independent given the class — and refusing to apologise for it. The assumption is wrong on essentially every real dataset, yet the classifier works. Understanding why is a tour through generative modelling, MAP estimation, Dirichlet priors, and the bias–variance tradeoff. This article walks the entire path.

Machine Learning Mathematical Derivations (8): Support Vector Machines

Tue, 27 Jan 2026 09:00:00 +0000

Hook. You have two clouds of points and infinitely many lines that separate them. Which line is “best”? SVM gives a startlingly geometric answer: the line that sits in the middle of the widest empty corridor between the two classes. Push that single idea through Lagrangian duality and it produces a sparse model (only the points on the corridor wall matter), a quadratic program with a global optimum, and – almost as a free gift – the kernel trick that lets the same linear machinery carve curved boundaries in infinite-dimensional spaces.

Machine Learning Mathematical Derivations (7): Decision Trees

Mon, 26 Jan 2026 09:00:00 +0000

Hook. A decision tree mimics how humans actually decide things: ask a question, branch on the answer, ask the next question. The math under that intuition is surprisingly rich — entropy from information theory tells us which question to ask first, the Gini index gives a cheaper proxy that lands on essentially the same trees, and cost-complexity pruning gives a principled way to stop the tree from memorising noise. Almost every modern boosted ensemble (XGBoost, LightGBM, CatBoost) is just a clever sum of these objects, so getting the foundations right pays off many times over.

Machine Learning Mathematical Derivations (6): Logistic Regression and Classification

Sun, 25 Jan 2026 09:00:00 +0000

Hook. Linear regression maps inputs to any real number — but what if the output has to be a probability between 0 and 1? Logistic regression solves this with one elegant trick: a sigmoid squashing function. Despite its name, logistic regression is a classification algorithm, and its math underpins every neuron in every modern neural network.

What You Will Learn

Why sigmoid is the natural way to turn a real-valued score into a probability, and why its derivative is so clean.
How cross-entropy loss falls out of maximum likelihood estimation in two lines.
Why cross-entropy beats MSE for classification — a vanishing-gradient argument made visible.
The full gradient and Hessian for both binary and multi-class (softmax) cases, and why the loss is convex.
L1, L2 and elastic-net regularization, and the Bayesian priors hiding behind them.
Decision-boundary geometry and the threshold-free metrics (ROC / PR / AUC) that you actually need under class imbalance.

Prerequisites

Calculus: chain rule, partial derivatives.
Linear algebra: matrix multiplication, transpose.
Probability: Bernoulli and categorical distributions, likelihood.
Familiarity with Part 5: Linear Regression .

1. From Linear Models to Probabilistic Classification

1.1 The Problem with Raw Linear Output

Linear regression gives us $\hat y = \mathbf{w}^\top \mathbf{x}$, which is unbounded. For classification, two things go wrong:

Mathematical Derivation of Machine Learning (5): Linear Regression

Sat, 24 Jan 2026 09:00:00 +0000

Hook. In 1886 Francis Galton noticed something strange about heredity: children of unusually tall (or short) parents tended to be closer to the average than their parents were. He called this drift toward the mean regression, and the name stuck. The statistical curiosity grew up into the most consequential model in machine learning – not because linear regression is powerful on its own, but because almost every other algorithm (logistic regression, neural networks, kernel methods) is some twist on the same idea: fit a line, but in the right space.

ML Math Derivations (4): Convex Optimization Theory

Fri, 23 Jan 2026 09:00:00 +0000

What This Article Covers

In 1947, George Dantzig proposed the simplex method for linear programming, and a working theory of optimization was born. Eight decades later, optimization has become the engine of machine learning: every model you train, from a one-line linear regression to a 70B-parameter language model, is the answer to some optimization problem.

Among all such problems, convex optimization holds a privileged place. The defining property is so strong it almost feels like cheating: every local minimum is automatically a global minimum, and a handful of well-understood algorithms come with airtight convergence guarantees. The whole reason we treat “convex” as a green flag and “non-convex” as a yellow one comes down to this single fact.

ML Math Derivations (3): Probability Theory and Statistical Inference

Thu, 22 Jan 2026 09:00:00 +0000

What This Article Covers

In 1912, Ronald Fisher introduced maximum likelihood estimation in a short paper that quietly redefined statistics. His insight was almost embarrassingly simple: if a parameter setting makes the observed data extremely likely, that parameter setting is probably right. Almost every modern learning algorithm — from logistic regression to large language models — is a descendant of this idea.

But likelihood alone is not enough. To use it we need a vocabulary for uncertainty (probability spaces, distributions), guarantees that empirical quantities track population ones (laws of large numbers, central limit theorem), and tools for incorporating prior knowledge (Bayesian inference). This article assembles those pieces into a coherent foundation for everything that follows.

ML Math Derivations (2): Linear Algebra and Matrix Theory

Wed, 21 Jan 2026 09:00:00 +0000

Why this chapter, and what’s different

If you have already worked through a standard linear-algebra course you have seen most of these objects. This chapter is not that course. It is the ML practitioner’s slice of linear algebra: the half-dozen ideas that actually appear when you implement gradient descent, run PCA, train a neural net, or read a paper.

Concretely the goals are:

Build a geometric intuition for what matrices do (rotate, stretch, project, kill).
Learn the four decompositions that show up everywhere – spectral, SVD, QR, Cholesky – and which one to reach for.
Master enough matrix calculus to derive any neural-net gradient on the back of an envelope.

We skim the algebra of row reduction, determinants by cofactor, and abstract vector-space proofs. If you need those, the references at the bottom give the standard treatments. Here, every concept comes back to a picture or a line of NumPy.

ML Math Derivations (1): Introduction and Mathematical Foundations

Tue, 20 Jan 2026 09:00:00 +0000

What this chapter does

In 2005 Google Research showed, on a public benchmark, that a statistical translation model trained on raw bilingual text could outperform decades of carefully engineered linguistic rules. The conclusion was uncomfortable for the experts of the day, but mathematically liberating: a system that has never been told the rules of a language can still recover them, given enough examples. Why?

The answer is not a trick of engineering – it is a theorem. In this chapter we build, from first principles, the part of mathematics that explains when learning from data is possible, how much data is required, and what fundamentally limits what any algorithm can do.