<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>ML Math Derivations on Chen Kai Blog</title><link>https://www.chenk.top/en/ml-math-derivations/</link><description>Recent content in ML Math Derivations on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sun, 08 Feb 2026 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/ml-math-derivations/index.xml" rel="self" type="application/rss+xml"/><item><title>ML Math Derivations (20): Regularization and Model Selection</title><link>https://www.chenk.top/en/ml-math-derivations/20-regularization-and-model-selection/</link><pubDate>Sun, 08 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/20-regularization-and-model-selection/</guid><description>&lt;h2 id="what-this-article-covers">What This Article Covers&lt;/h2>
&lt;p>A 100-million-parameter network trained on 50,000 images &lt;em>should&lt;/em> overfit catastrophically. Modern deep networks generalise anyway. &lt;strong>Why?&lt;/strong> Two ingredients: &lt;em>regularisation&lt;/em> (techniques that constrain capacity) and &lt;em>generalisation theory&lt;/em> (mathematics that says when learning works at all). This article is the closing chapter of the series, and we use it to gather every tool we have built — least squares, MAP estimation, optimisation, EM, neural networks — and turn them on the deepest open question in the field: &lt;em>why does learning generalise?&lt;/em>&lt;/p></description></item><item><title>ML Math Derivations (19): Neural Networks and Backpropagation</title><link>https://www.chenk.top/en/ml-math-derivations/19-neural-networks-and-backpropagation/</link><pubDate>Sat, 07 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/19-neural-networks-and-backpropagation/</guid><description>&lt;h2 id="what-this-article-covers">What This Article Covers&lt;/h2>
&lt;p>A single perceptron cannot solve XOR. Stack enough of them with nonlinear activations and you obtain a &lt;em>universal function approximator&lt;/em>. The remaining question is how such a network learns from data. The answer — &lt;strong>backpropagation&lt;/strong>, an efficient application of the chain rule that recycles intermediate results during a single backward sweep — is the engine behind every deep learning library written in the last forty years. Understanding it mathematically reveals two further truths: why deep networks suffer from vanishing or exploding gradients, and why the choice of weight initialization is much less arbitrary than it first appears.&lt;/p></description></item><item><title>ML Math Derivations (18): Clustering Algorithms</title><link>https://www.chenk.top/en/ml-math-derivations/18-clustering-algorithms/</link><pubDate>Fri, 06 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/18-clustering-algorithms/</guid><description>&lt;h2 id="what-this-article-covers">What This Article Covers&lt;/h2>
&lt;p>A million customer records arrive with no labels. Can you discover meaningful groups automatically? That is &lt;strong>clustering&lt;/strong>, the most fundamental unsupervised learning task. Unlike classification, clustering forces you to first answer a slippery question: &lt;em>what does &amp;ldquo;similar&amp;rdquo; even mean?&lt;/em> Every clustering algorithm is, at heart, a different answer to that question &amp;ndash; a different geometric, probabilistic, or graph-theoretic prior on what a &amp;ldquo;group&amp;rdquo; is.&lt;/p>
&lt;p>&lt;strong>What you will learn:&lt;/strong>&lt;/p></description></item><item><title>ML Math Derivations (17): Dimensionality Reduction and PCA</title><link>https://www.chenk.top/en/ml-math-derivations/17-dimensionality-reduction-and-pca/</link><pubDate>Thu, 05 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/17-dimensionality-reduction-and-pca/</guid><description>&lt;h2 id="what-this-article-covers">What This Article Covers&lt;/h2>
&lt;p>Feed a clustering algorithm $10{,}000$-dimensional data and it will most likely fail &amp;ndash; not because the algorithm is broken, but because &lt;strong>high-dimensional space is a hostile environment for distance-based learning&lt;/strong>. Volumes evaporate into thin shells, the ratio of nearest- to farthest-neighbour distances tends to $1$, and &amp;ldquo;closeness&amp;rdquo; stops carrying information. Dimensionality reduction is the response: project the data into a lower-dimensional space while keeping the structure that actually matters.&lt;/p></description></item><item><title>ML Math Derivations (16): Conditional Random Fields</title><link>https://www.chenk.top/en/ml-math-derivations/16-conditional-random-fields/</link><pubDate>Wed, 04 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/16-conditional-random-fields/</guid><description>&lt;h2 id="what-this-article-covers">What This Article Covers&lt;/h2>
&lt;p>Named entity recognition, POS tagging, information extraction &amp;ndash; every one of these tasks asks you to label each element of a sequence. HMMs (&lt;a href="https://www.chenk.top/en/Machine-Learning-Mathematical-Derivations-15-Hidden-Markov-Models/">Part 15&lt;/a>
) attack this problem &lt;strong>generatively&lt;/strong> by modelling the joint distribution $P(\mathbf{X},\mathbf{Y})$, but to make the joint factorise they pay a steep price: each observation is assumed independent of everything except its own hidden label. In real text, whether &lt;em>bank&lt;/em> is a noun or a verb depends on the preceding word, the following word, the suffix, capitalisation, dictionary lookups &amp;ndash; all of these features at once.&lt;/p></description></item><item><title>Machine Learning Mathematical Derivations (15): Hidden Markov Models</title><link>https://www.chenk.top/en/ml-math-derivations/15-hidden-markov-models/</link><pubDate>Tue, 03 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/15-hidden-markov-models/</guid><description>&lt;p>You hear footsteps behind you in a fog. You cannot see the walker, only the sounds. From the rhythm and pitch &amp;ndash; short, soft, hurried &amp;ndash; can you guess whether they are walking, running, or limping? And if you observed an entire sequence, which gait sequence is most likely? How likely is &lt;em>any&lt;/em> sequence of sounds under your model of how walking works?&lt;/p>
&lt;p>These are the &lt;strong>three problems of HMMs&lt;/strong>, and the surprise is that all three reduce to one trick: write the joint $P(\mathbf{O}, \mathbf{I})$ as a product of local factors along time, then &lt;strong>share sub-computations across time&lt;/strong> with dynamic programming. Brute force costs $O(N^T)$. Forward-Backward, Viterbi, and Baum-Welch all cost $O(N^2 T)$. The exponent collapses because the Markov assumption makes the future conditionally independent of the past given the present.&lt;/p></description></item><item><title>Machine Learning Mathematical Derivations (14): Variational Inference and Variational EM</title><link>https://www.chenk.top/en/ml-math-derivations/14-variational-inference-and-variational-em/</link><pubDate>Mon, 02 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/14-variational-inference-and-variational-em/</guid><description>&lt;p>When the posterior $p(\mathbf{z}\mid\mathbf{x})$ is intractable, you have two roads. &lt;strong>Sampling&lt;/strong> (MCMC) walks a Markov chain whose stationary distribution is the posterior — eventually exact, but slow and hard to diagnose. &lt;strong>Variational inference&lt;/strong> (VI) instead picks a simple family $\mathcal{Q}$ of distributions and finds the member $q^\star\in\mathcal{Q}$ that lies closest to the true posterior. Inference becomes optimization, and the same machinery that fits a neural network now fits a Bayesian model.&lt;/p></description></item><item><title>Machine Learning Mathematical Derivations (13): EM Algorithm and GMM</title><link>https://www.chenk.top/en/ml-math-derivations/13-em-algorithm-and-gmm/</link><pubDate>Sun, 01 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/13-em-algorithm-and-gmm/</guid><description>&lt;p>When data carries hidden structure &amp;ndash; a cluster label you never observed, a missing feature, a topic you cannot directly see &amp;ndash; maximum likelihood becomes painful. The log of a sum has no closed form, and gradient methods get tangled in the latent variables. The &lt;strong>EM algorithm&lt;/strong> sidesteps the difficulty with a deceptively simple idea: alternate between &lt;em>guessing&lt;/em> the hidden variables under a posterior (E-step) and &lt;em>fitting&lt;/em> the parameters as if those guesses were true (M-step). Each iteration is mathematically guaranteed to push the likelihood up. This post derives EM from first principles, proves the monotone-ascent property via Jensen&amp;rsquo;s inequality, and works through its most famous application: &lt;strong>Gaussian Mixture Models (GMM)&lt;/strong> &amp;ndash; the soft, elliptical generalisation of K-means.&lt;/p></description></item><item><title>Machine Learning Mathematical Derivations (12): XGBoost and LightGBM</title><link>https://www.chenk.top/en/ml-math-derivations/12-xgboost-and-lightgbm/</link><pubDate>Sat, 31 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/12-xgboost-and-lightgbm/</guid><description>&lt;p>XGBoost and LightGBM are the two libraries that quietly win most tabular-data battles &amp;mdash; on Kaggle leaderboards, in fraud-detection pipelines, in ad ranking, in churn models. They share the same backbone (gradient-boosted trees, Part 11) but make very different engineering bets:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>XGBoost&lt;/strong> sharpens the &lt;em>math&lt;/em>: it brings the second derivative of the loss into the objective, regularises the tree itself, and turns split selection into a closed-form score.&lt;/li>
&lt;li>&lt;strong>LightGBM&lt;/strong> sharpens the &lt;em>systems&lt;/em>: it bins features into a small histogram, grows trees leaf-by-leaf, throws away uninformative samples (GOSS) and bundles mutually exclusive sparse features (EFB).&lt;/li>
&lt;/ul>
&lt;p>The result is two tools that look interchangeable from the API but behave very differently when $N$ or $d$ becomes large. This post derives every formula behind those choices so you can read a tuning guide and know &lt;em>why&lt;/em> each knob exists.&lt;/p></description></item><item><title>Machine Learning Mathematical Derivations (11): Ensemble Learning</title><link>https://www.chenk.top/en/ml-math-derivations/11-ensemble-learning/</link><pubDate>Fri, 30 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/11-ensemble-learning/</guid><description>&lt;p>Why does a committee of mediocre classifiers outperform a single brilliant one? The answer is unromantic but precise: averaging cuts variance, sequential reweighting cuts bias, and a little randomisation breaks the correlation that would otherwise destroy both effects. This post derives the mathematics behind that picture &amp;mdash; bias&amp;ndash;variance decomposition, bootstrap aggregating, AdaBoost as forward stagewise minimisation of exponential loss, and gradient boosting as gradient descent in function space.&lt;/p>
&lt;p>By the end you should be able to look at any ensemble method and say &lt;em>what it is reducing, why it works, and when it will fail.&lt;/em>&lt;/p></description></item><item><title>Machine Learning Mathematical Derivations (10): Semi-Naive Bayes and Bayesian Networks</title><link>https://www.chenk.top/en/ml-math-derivations/10-semi-naive-bayes-and-bayesian-networks/</link><pubDate>Thu, 29 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/10-semi-naive-bayes-and-bayesian-networks/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> Naive Bayes assumes every feature is conditionally independent given the class. It is a convenient lie &amp;ndash; one that lets us train in a single pass over the data, but one that classifiers based on tree structures and small graphs can systematically beat by a few accuracy points on virtually every UCI benchmark. This part walks the spectrum from &amp;ldquo;no dependencies&amp;rdquo; (Naive Bayes) to &amp;ldquo;all dependencies&amp;rdquo; (full joint), showing the three sweet spots that practitioners actually use: SPODE, TAN and AODE. The same factorisation idea, taken to its general form, is the Bayesian network.&lt;/p></description></item><item><title>Machine Learning Mathematical Derivations (9): Naive Bayes</title><link>https://www.chenk.top/en/ml-math-derivations/09-naive-bayes/</link><pubDate>Wed, 28 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/09-naive-bayes/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook:&lt;/strong> A spam filter that trains in milliseconds, scales to a million features, has &lt;em>no hyperparameters worth tuning&lt;/em>, and still beats much fancier models on short-text problems. Naive Bayes pulls this off by making one outrageous assumption — every feature is independent given the class — and refusing to apologise for it. The assumption is wrong on essentially every real dataset, yet the classifier works. Understanding &lt;em>why&lt;/em> is a tour through generative modelling, MAP estimation, Dirichlet priors, and the bias–variance tradeoff. This article walks the entire path.&lt;/p></description></item><item><title>Machine Learning Mathematical Derivations (8): Support Vector Machines</title><link>https://www.chenk.top/en/ml-math-derivations/08-support-vector-machines/</link><pubDate>Tue, 27 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/08-support-vector-machines/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> You have two clouds of points and infinitely many lines that separate them. Which line is &amp;ldquo;best&amp;rdquo;? SVM gives a startlingly geometric answer: the line that sits in the middle of the &lt;em>widest empty corridor&lt;/em> between the two classes. Push that single idea through Lagrangian duality and it produces a sparse model (only the points on the corridor wall matter), a quadratic program with a global optimum, and &amp;ndash; almost as a free gift &amp;ndash; the kernel trick that lets the same linear machinery carve curved boundaries in infinite-dimensional spaces.&lt;/p></description></item><item><title>Machine Learning Mathematical Derivations (7): Decision Trees</title><link>https://www.chenk.top/en/ml-math-derivations/07-decision-trees/</link><pubDate>Mon, 26 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/07-decision-trees/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> A decision tree mimics how humans actually decide things: ask a question, branch on the answer, ask the next question. The math under that intuition is surprisingly rich — entropy from information theory tells us &lt;em>which&lt;/em> question to ask first, the Gini index gives a cheaper proxy that lands on essentially the same trees, and cost-complexity pruning gives a principled way to stop the tree from memorising noise. Almost every modern boosted ensemble (XGBoost, LightGBM, CatBoost) is just a clever sum of these objects, so getting the foundations right pays off many times over.&lt;/p></description></item><item><title>Machine Learning Mathematical Derivations (6): Logistic Regression and Classification</title><link>https://www.chenk.top/en/ml-math-derivations/06-logistic-regression-and-classification/</link><pubDate>Sun, 25 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/06-logistic-regression-and-classification/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> Linear regression maps inputs to any real number — but what if the output has to be a probability between 0 and 1? Logistic regression solves this with one elegant trick: a sigmoid squashing function. Despite its name, logistic regression is a &lt;em>classification&lt;/em> algorithm, and its math underpins every neuron in every modern neural network.&lt;/p>
&lt;/blockquote>
&lt;h2 id="what-you-will-learn">What You Will Learn&lt;/h2>
&lt;ul>
&lt;li>Why sigmoid is the natural way to turn a real-valued score into a probability, and why its derivative is so clean.&lt;/li>
&lt;li>How cross-entropy loss falls out of maximum likelihood estimation in two lines.&lt;/li>
&lt;li>Why cross-entropy beats MSE for classification — a vanishing-gradient argument made visible.&lt;/li>
&lt;li>The full gradient and Hessian for both binary and multi-class (softmax) cases, and why the loss is convex.&lt;/li>
&lt;li>L1, L2 and elastic-net regularization, and the Bayesian priors hiding behind them.&lt;/li>
&lt;li>Decision-boundary geometry and the threshold-free metrics (ROC / PR / AUC) that you actually need under class imbalance.&lt;/li>
&lt;/ul>
&lt;h2 id="prerequisites">Prerequisites&lt;/h2>
&lt;ul>
&lt;li>Calculus: chain rule, partial derivatives.&lt;/li>
&lt;li>Linear algebra: matrix multiplication, transpose.&lt;/li>
&lt;li>Probability: Bernoulli and categorical distributions, likelihood.&lt;/li>
&lt;li>Familiarity with &lt;a href="https://www.chenk.top/en/Machine-Learning-Mathematical-Derivations-5-Linear-Regression/">Part 5: Linear Regression&lt;/a>
.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="1-from-linear-models-to-probabilistic-classification">1. From Linear Models to Probabilistic Classification&lt;/h2>
&lt;h3 id="11-the-problem-with-raw-linear-output">1.1 The Problem with Raw Linear Output&lt;/h3>
&lt;p>Linear regression gives us $\hat y = \mathbf{w}^\top \mathbf{x}$, which is unbounded. For classification, two things go wrong:&lt;/p></description></item><item><title>Mathematical Derivation of Machine Learning (5): Linear Regression</title><link>https://www.chenk.top/en/ml-math-derivations/05-linear-regression/</link><pubDate>Sat, 24 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/05-linear-regression/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> In 1886 Francis Galton noticed something strange about heredity: children of unusually tall (or short) parents tended to be closer to the average than their parents were. He called this drift toward the mean &lt;em>regression&lt;/em>, and the name stuck. The statistical curiosity grew up into the most consequential model in machine learning &amp;ndash; not because linear regression is powerful on its own, but because almost every other algorithm (logistic regression, neural networks, kernel methods) is some twist on the same idea: &lt;strong>fit a line, but in the right space.&lt;/strong>&lt;/p></description></item><item><title>ML Math Derivations (4): Convex Optimization Theory</title><link>https://www.chenk.top/en/ml-math-derivations/04-convex-optimization-theory/</link><pubDate>Fri, 23 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/04-convex-optimization-theory/</guid><description>&lt;h2 id="what-this-article-covers">What This Article Covers&lt;/h2>
&lt;p>In 1947, George Dantzig proposed the simplex method for linear programming, and a working theory of optimization was born. Eight decades later, optimization has become the engine of machine learning: every model you train, from a one-line linear regression to a 70B-parameter language model, is the answer to &lt;em>some&lt;/em> optimization problem.&lt;/p>
&lt;p>Among all such problems, &lt;strong>convex optimization holds a privileged place&lt;/strong>. The defining property is so strong it almost feels like cheating: every local minimum is automatically a global minimum, and a handful of well-understood algorithms come with airtight convergence guarantees. The whole reason we treat &amp;ldquo;convex&amp;rdquo; as a green flag and &amp;ldquo;non-convex&amp;rdquo; as a yellow one comes down to this single fact.&lt;/p></description></item><item><title>ML Math Derivations (3): Probability Theory and Statistical Inference</title><link>https://www.chenk.top/en/ml-math-derivations/03-probability-theory-and-statistical-inference/</link><pubDate>Thu, 22 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/03-probability-theory-and-statistical-inference/</guid><description>&lt;h2 id="what-this-article-covers">What This Article Covers&lt;/h2>
&lt;p>In 1912, Ronald Fisher introduced &lt;strong>maximum likelihood estimation&lt;/strong> in a short paper that quietly redefined statistics. His insight was almost embarrassingly simple: &lt;em>if a parameter setting makes the observed data extremely likely, that parameter setting is probably right&lt;/em>. Almost every modern learning algorithm — from logistic regression to large language models — is a descendant of this idea.&lt;/p>
&lt;p>But likelihood alone is not enough. To use it we need a vocabulary for uncertainty (probability spaces, distributions), guarantees that empirical quantities track population ones (laws of large numbers, central limit theorem), and tools for incorporating prior knowledge (Bayesian inference). This article assembles those pieces into a coherent foundation for everything that follows.&lt;/p></description></item><item><title>ML Math Derivations (2): Linear Algebra and Matrix Theory</title><link>https://www.chenk.top/en/ml-math-derivations/02-linear-algebra-and-matrix-theory/</link><pubDate>Wed, 21 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/02-linear-algebra-and-matrix-theory/</guid><description>&lt;h2 id="why-this-chapter-and-whats-different">Why this chapter, and what&amp;rsquo;s different&lt;/h2>
&lt;p>If you have already worked through a standard linear-algebra course you have seen most of these objects. &lt;strong>This chapter is not that course.&lt;/strong> It is the &lt;em>ML practitioner&amp;rsquo;s slice&lt;/em> of linear algebra: the half-dozen ideas that actually appear when you implement gradient descent, run PCA, train a neural net, or read a paper.&lt;/p>
&lt;p>Concretely the goals are:&lt;/p>
&lt;ol>
&lt;li>Build a &lt;strong>geometric intuition&lt;/strong> for what matrices &lt;em>do&lt;/em> (rotate, stretch, project, kill).&lt;/li>
&lt;li>Learn the four decompositions that show up everywhere &amp;ndash; spectral, &lt;strong>SVD&lt;/strong>, QR, Cholesky &amp;ndash; and &lt;em>which one to reach for&lt;/em>.&lt;/li>
&lt;li>Master enough &lt;strong>matrix calculus&lt;/strong> to derive any neural-net gradient on the back of an envelope.&lt;/li>
&lt;/ol>
&lt;p>We skim the algebra of row reduction, determinants by cofactor, and abstract vector-space proofs. If you need those, the references at the bottom give the standard treatments. Here, every concept comes back to a picture or a line of NumPy.&lt;/p></description></item><item><title>ML Math Derivations (1): Introduction and Mathematical Foundations</title><link>https://www.chenk.top/en/ml-math-derivations/01-introduction-and-mathematical-foundations/</link><pubDate>Tue, 20 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/01-introduction-and-mathematical-foundations/</guid><description>&lt;h2 id="what-this-chapter-does">What this chapter does&lt;/h2>
&lt;p>In 2005 Google Research showed, on a public benchmark, that a statistical translation model trained on raw bilingual text could outperform decades of carefully engineered linguistic rules. The conclusion was uncomfortable for the experts of the day, but mathematically liberating: &lt;strong>a system that has never been told the rules of a language can still recover them, given enough examples.&lt;/strong> Why?&lt;/p>
&lt;p>The answer is not a trick of engineering &amp;ndash; it is a theorem. In this chapter we build, from first principles, the part of mathematics that explains &lt;em>when&lt;/em> learning from data is possible, &lt;em>how much data&lt;/em> is required, and &lt;em>what fundamentally limits&lt;/em> what any algorithm can do.&lt;/p></description></item></channel></rss>