<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Machine Learning on Chen Kai Blog</title><link>https://www.chenk.top/en/categories/machine-learning/</link><description>Recent content in Machine Learning on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sun, 08 Feb 2026 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/categories/machine-learning/index.xml" rel="self" type="application/rss+xml"/><item><title>ML Math Derivations (20): Regularization and Model Selection</title><link>https://www.chenk.top/en/ml-math-derivations/20-regularization-and-model-selection/</link><pubDate>Sun, 08 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/20-regularization-and-model-selection/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/20-Regularization-and-Model-Selection/illustration_1.png" alt="ML Math Derivations (20): Regularization and Model Selection — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>A 100-million-parameter network trained on 50,000 images &lt;em>should&lt;/em> overfit catastrophically. Modern deep networks generalise anyway. &lt;strong>Why?&lt;/strong> Two ingredients: &lt;em>regularisation&lt;/em> (techniques that constrain capacity) and &lt;em>generalisation theory&lt;/em> (mathematics that says when learning works at all). This article is the closing chapter of the series, and we use it to gather every tool we have built — least squares, MAP estimation, optimisation, EM, neural networks — and turn them on the deepest open question in the field: &lt;em>why does learning generalise?&lt;/em>&lt;/p></description></item><item><title>ML Math Derivations (19): Neural Networks and Backpropagation</title><link>https://www.chenk.top/en/ml-math-derivations/19-neural-networks-and-backpropagation/</link><pubDate>Sat, 07 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/19-neural-networks-and-backpropagation/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/19-Neural-Networks-and-Backpropagation/illustration_1.png" alt="ML Math Derivations (19): Neural Networks and Backpropagation — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> In 1969 Minsky and Papert proved that a single perceptron could not learn XOR, and connectionist research went into a fifteen-year freeze. The thaw came when Rumelhart, Hinton and Williams realised that &lt;em>stacking&lt;/em> perceptrons makes the problem disappear — and that the same chain rule everyone learns in calculus, applied carefully, computes every gradient in a multilayer network for the cost of a single extra forward pass. That algorithm is backpropagation. Every gradient in every Transformer, every diffusion model, every GPT trained today still runs on it.&lt;/p></description></item><item><title>ML Math Derivations (18): Clustering Algorithms</title><link>https://www.chenk.top/en/ml-math-derivations/18-clustering-algorithms/</link><pubDate>Fri, 06 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/18-clustering-algorithms/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/18-Clustering-Algorithms/illustration_1.png" alt="ML Math Derivations (18): Clustering Algorithms — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>A million customer records arrive with no labels. Can you discover meaningful groups automatically? That is &lt;strong>clustering&lt;/strong>, the most fundamental unsupervised learning task. Unlike classification, clustering forces you to first answer a slippery question: &lt;em>what does &amp;ldquo;similar&amp;rdquo; even mean?&lt;/em> Every clustering algorithm is, at heart, a different answer to that question — a different geometric, probabilistic, or graph-theoretic prior on what a &amp;ldquo;group&amp;rdquo; is.&lt;/p></description></item><item><title>ML Math Derivations (17): Dimensionality Reduction and PCA</title><link>https://www.chenk.top/en/ml-math-derivations/17-dimensionality-reduction-and-pca/</link><pubDate>Thu, 05 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/17-dimensionality-reduction-and-pca/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/17-Dimensionality-Reduction-and-PCA/illustration_1.png" alt="ML Math Derivations (17): Dimensionality Reduction and PCA — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>Feed a clustering algorithm &lt;span class="math-inline">$10{,}000$&lt;/span>
-dimensional data and it will most likely fail — not because the algorithm is broken, but because &lt;strong>high-dimensional space is a hostile environment for distance-based learning&lt;/strong>. Volumes evaporate into thin shells, the ratio of nearest- to farthest-neighbour distances tends to &lt;span class="math-inline">$1$&lt;/span>
, and &amp;ldquo;closeness&amp;rdquo; stops carrying information. Dimensionality reduction is the response: project the data into a lower-dimensional space while keeping the structure that actually matters.&lt;/p></description></item><item><title>ML Math Derivations (16): Conditional Random Fields</title><link>https://www.chenk.top/en/ml-math-derivations/16-conditional-random-fields/</link><pubDate>Wed, 04 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/16-conditional-random-fields/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/16-Conditional-Random-Fields/illustration_1.png" alt="ML Math Derivations (16): Conditional Random Fields — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>Named entity recognition, POS tagging, information extraction — every one of these tasks asks you to label each element of a sequence. HMMs (&lt;a href="https://www.chenk.top/en/ml-math-derivations/15-hidden-markov-models">Part 15&lt;/a>
) attack this problem &lt;strong>generatively&lt;/strong> by modelling the joint distribution &lt;span class="math-inline">$P(\mathbf{X},\mathbf{Y})$&lt;/span>
, but to make the joint factorise they pay a steep price: each observation is assumed independent of everything except its own hidden label. In real text, whether &lt;em>bank&lt;/em> is a noun or a verb depends on the preceding and following words, the suffix, capitalization, and dictionary lookups — all these features together.&lt;/p></description></item><item><title>ML Math Derivations (15): Hidden Markov Models</title><link>https://www.chenk.top/en/ml-math-derivations/15-hidden-markov-models/</link><pubDate>Tue, 03 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/15-hidden-markov-models/</guid><description>&lt;p>You hear footsteps behind you in the fog. You can&amp;rsquo;t see the walker, only the sounds. From the rhythm and pitch — short, soft, hurried — can you guess whether they are walking, running, or limping? And if you observed an entire sequence, which gait sequence is most likely? How likely is &lt;em>any&lt;/em> sequence of sounds under your model of how walking works?&lt;/p>
&lt;p>These are the &lt;strong>three problems of HMMs&lt;/strong>, and the surprise is that all three reduce to one trick: write the joint &lt;span class="math-inline">$P(\mathbf{O}, \mathbf{I})$&lt;/span>
 as a product of local factors along time, then &lt;strong>share sub-computations across time&lt;/strong> with dynamic programming. Brute force costs &lt;span class="math-inline">$O(N^T)$&lt;/span>
. Forward-Backward, Viterbi, and Baum-Welch all cost &lt;span class="math-inline">$O(N^2 T)$&lt;/span>
. The exponent collapses because the Markov assumption makes the future conditionally independent of the past given the present.&lt;/p></description></item><item><title>ML Math Derivations (14): Variational Inference and Variational EM</title><link>https://www.chenk.top/en/ml-math-derivations/14-variational-inference-and-variational-em/</link><pubDate>Mon, 02 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/14-variational-inference-and-variational-em/</guid><description>&lt;p>When the posterior &lt;span class="math-inline">$p(\mathbf{z}\mid\mathbf{x})$&lt;/span>
 is intractable, you have two roads. &lt;strong>Sampling&lt;/strong> (MCMC) walks a Markov chain whose stationary distribution is the posterior — eventually exact, but slow and hard to diagnose. &lt;strong>Variational inference&lt;/strong> (VI) instead picks a simple family &lt;span class="math-inline">$\mathcal{Q}$&lt;/span>
 of distributions and finds the member &lt;span class="math-inline">$q^\star\in\mathcal{Q}$&lt;/span>
 that lies closest to the true posterior. Inference becomes optimization, and the same machinery that fits a neural network now fits a Bayesian model.&lt;/p>
&lt;p>This post derives VI from a single identity, builds the mean-field algorithm and CAVI from that identity, connects EM and variational EM as special cases, and ends with the reparameterization trick that turns the ELBO into a stochastic objective compatible with autodiff — the engine inside every VAE.&lt;/p></description></item><item><title>ML Math Derivations (13): EM Algorithm and GMM</title><link>https://www.chenk.top/en/ml-math-derivations/13-em-algorithm-and-gmm/</link><pubDate>Sun, 01 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/13-em-algorithm-and-gmm/</guid><description>&lt;p>When data has hidden structure — like an unobserved cluster label, a missing feature, or an unseen topic — maximum likelihood becomes challenging. The log of a sum has no closed form, and gradient methods get entangled with the latent variables. The &lt;strong>EM algorithm&lt;/strong> sidesteps the difficulty with a deceptively simple idea: alternate between &lt;em>guessing&lt;/em> the hidden variables under a posterior (E-step) and &lt;em>fitting&lt;/em> the parameters as if those guesses were true (M-step). Each iteration is mathematically guaranteed to push the likelihood up. This post derives EM from first principles, proves the monotone-ascent property using Jensen&amp;rsquo;s inequality, and explores its most famous application: &lt;strong>Gaussian Mixture Models (GMM)&lt;/strong> — the soft, elliptical generalization of K-means.&lt;/p></description></item><item><title>ML Math Derivations (12): XGBoost and LightGBM</title><link>https://www.chenk.top/en/ml-math-derivations/12-xgboost-and-lightgbm/</link><pubDate>Sat, 31 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/12-xgboost-and-lightgbm/</guid><description>&lt;p>XGBoost and LightGBM are the two libraries that quietly win most tabular-data battles &amp;mdash; on Kaggle leaderboards, in fraud-detection pipelines, in ad ranking, in churn models. They share the same backbone (gradient-boosted trees, &lt;a href="https://www.chenk.top/en/ml-math-derivations/11-ensemble-learning/">Part 11&lt;/a>
) but make very different engineering bets:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>XGBoost&lt;/strong> sharpens the &lt;em>math&lt;/em>: it brings the second derivative of the loss into the objective, regularises the tree itself, and turns split selection into a closed-form score.&lt;/li>
&lt;li>&lt;strong>LightGBM&lt;/strong> sharpens the &lt;em>systems&lt;/em>: it bins features into a small histogram, grows trees leaf-by-leaf, throws away uninformative samples (GOSS) and bundles mutually exclusive sparse features (EFB).&lt;/li>
&lt;/ul>
&lt;p>The result is two tools that look interchangeable from the API but behave very differently when &lt;span class="math-inline">$N$&lt;/span>
 or &lt;span class="math-inline">$d$&lt;/span>
 becomes large. This post derives every formula behind those choices so you can read a tuning guide and know &lt;em>why&lt;/em> each knob exists.&lt;/p></description></item><item><title>ML Math Derivations (11): Ensemble Learning</title><link>https://www.chenk.top/en/ml-math-derivations/11-ensemble-learning/</link><pubDate>Fri, 30 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/11-ensemble-learning/</guid><description>&lt;p>Why do mediocre classifiers in a committee outperform a single brilliant one? The answer is straightforward: averaging reduces variance, sequential reweighting reduces bias, and a bit of randomization breaks the correlation that would otherwise negate these benefits. This post delves into the math behind this — bias-variance decomposition, bootstrap aggregating, AdaBoost as forward stagewise minimization of exponential loss, and gradient boosting as gradient descent in function space.&lt;/p>
&lt;p>By the end, you should be able to look at any ensemble method and say what it reduces, why it works, and when it fails.&lt;/p></description></item><item><title>ML Math Derivations (10): Semi-Naive Bayes and Bayesian Networks</title><link>https://www.chenk.top/en/ml-math-derivations/10-semi-naive-bayes-and-bayesian-networks/</link><pubDate>Thu, 29 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/10-semi-naive-bayes-and-bayesian-networks/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> Naive Bayes assumes every feature is conditionally independent given the class. It is a convenient lie — one that lets us train in a single pass over the data, but one that classifiers based on tree structures and small graphs can systematically beat by a few accuracy points on virtually every UCI benchmark. This part walks the spectrum from &amp;ldquo;no dependencies&amp;rdquo; (Naive Bayes) to &amp;ldquo;all dependencies&amp;rdquo; (full joint), showing the three sweet spots that practitioners actually use: SPODE, TAN and AODE. The same factorisation idea, taken to its general form, is the Bayesian network.&lt;/p></description></item><item><title>ML Math Derivations (9): Naive Bayes</title><link>https://www.chenk.top/en/ml-math-derivations/09-naive-bayes/</link><pubDate>Wed, 28 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/09-naive-bayes/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook:&lt;/strong> A spam filter that trains in milliseconds, scales to a million features, has &lt;em>no hyperparameters worth tuning&lt;/em>, and still beats much fancier models on short-text problems. Naive Bayes pulls this off by making one outrageous assumption — every feature is independent given the class — and refusing to apologise for it. The assumption is wrong on essentially every real dataset, yet the classifier works. Understanding &lt;em>why&lt;/em> is a tour through generative modelling, MAP estimation, Dirichlet priors, and the bias–variance tradeoff. This article walks the entire path.&lt;/p></description></item><item><title>ML Math Derivations (8): Support Vector Machines</title><link>https://www.chenk.top/en/ml-math-derivations/08-support-vector-machines/</link><pubDate>Tue, 27 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/08-support-vector-machines/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> You have two clouds of points and infinitely many lines that separate them. Which line is &amp;ldquo;best&amp;rdquo;? SVM gives a startlingly geometric answer: the line that sits in the middle of the &lt;em>widest empty corridor&lt;/em> between the two classes. Push that single idea through Lagrangian duality and it produces a sparse model (only the points on the corridor wall matter), a quadratic program with a global optimum, and — almost as a free gift — the kernel trick that lets the same linear machinery carve curved boundaries in infinite-dimensional spaces.&lt;/p></description></item><item><title>ML Math Derivations (7): Decision Trees</title><link>https://www.chenk.top/en/ml-math-derivations/07-decision-trees/</link><pubDate>Mon, 26 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/07-decision-trees/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> A decision tree mimics how humans actually decide things: ask a question, branch on the answer, ask the next question. The math under that intuition is surprisingly rich — entropy from information theory tells us &lt;em>which&lt;/em> question to ask first, the Gini index gives a cheaper proxy that lands on essentially the same trees, and cost-complexity pruning gives a principled way to stop the tree from memorising noise. Almost every modern boosted ensemble (XGBoost, LightGBM, CatBoost) is just a clever sum of these objects, so getting the foundations right pays off many times over.&lt;/p></description></item><item><title>ML Math Derivations (6): Logistic Regression and Classification</title><link>https://www.chenk.top/en/ml-math-derivations/06-logistic-regression-and-classification/</link><pubDate>Sun, 25 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/06-logistic-regression-and-classification/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> Linear regression maps inputs to any real number — but what if the output has to be a probability between 0 and 1? Logistic regression solves this with one elegant trick: a sigmoid squashing function. Despite its name, logistic regression is a &lt;em>classification&lt;/em> algorithm, and its math underpins every neuron in every modern neural network.&lt;/p>
&lt;/blockquote>
&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/06-Logistic-Regression-and-Classification/illustration_1.png" alt="ML Math Derivations (6): Logistic Regression and Classification — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>Why sigmoid is the natural way to turn a real-valued score into a probability, and why its derivative is so clean.&lt;/li>
&lt;li>How cross-entropy loss falls out of maximum likelihood estimation in two lines.&lt;/li>
&lt;li>Why cross-entropy beats MSE for classification — a vanishing-gradient argument made visible.&lt;/li>
&lt;li>The full gradient and Hessian for both binary and multi-class (softmax) cases, and why the loss is convex.&lt;/li>
&lt;li>L1, L2 and elastic-net regularization, and the Bayesian priors hiding behind them.&lt;/li>
&lt;li>Decision-boundary geometry and the threshold-free metrics (ROC / PR / AUC) that you actually need under class imbalance.&lt;/li>
&lt;/ul>
&lt;h2 id="prerequisites" class="heading-anchor">Prerequisites&lt;a href="#prerequisites" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>Calculus: chain rule, partial derivatives.&lt;/li>
&lt;li>Linear algebra: matrix multiplication, transpose.&lt;/li>
&lt;li>Probability: Bernoulli and categorical distributions, likelihood.&lt;/li>
&lt;li>Familiarity with &lt;a href="https://www.chenk.top/en/ml-math-derivations/05-linear-regression">Part 5: Linear Regression&lt;/a>
.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="from-linear-models-to-probabilistic-classification" class="heading-anchor">From Linear Models to Probabilistic Classification&lt;a href="#from-linear-models-to-probabilistic-classification" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;h3 id="the-problem-with-raw-linear-output" class="heading-anchor">The Problem with Raw Linear Output&lt;a href="#the-problem-with-raw-linear-output" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h3>&lt;p>Linear regression gives us &lt;span class="math-inline">$\hat y = \mathbf{w}^\top \mathbf{x}$&lt;/span>
, which is unbounded. For classification, two things go wrong:&lt;/p></description></item><item><title>ML Math Derivations (5): Linear Regression</title><link>https://www.chenk.top/en/ml-math-derivations/05-linear-regression/</link><pubDate>Sat, 24 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/05-linear-regression/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> In 1886 Francis Galton noticed something strange about heredity: children of unusually tall (or short) parents tended to be closer to the average than their parents were. He called this drift toward the mean &lt;em>regression&lt;/em>, and the name stuck. The statistical curiosity grew up into the most consequential model in machine learning — not because linear regression is powerful on its own, but because almost every other algorithm (logistic regression, neural networks, kernel methods) is some twist on the same idea: &lt;strong>fit a line, but in the right space.&lt;/strong>&lt;/p></description></item><item><title>ML Math Derivations (4): Convex Optimization Theory</title><link>https://www.chenk.top/en/ml-math-derivations/04-convex-optimization-theory/</link><pubDate>Fri, 23 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/04-convex-optimization-theory/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/04-Convex-Optimization-Theory/illustration_1.png" alt="ML Math Derivations (4): Convex Optimization Theory — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>In 1947, George Dantzig proposed the simplex method for linear programming, and a working theory of optimization was born. Eight decades later, optimization has become the engine of machine learning: every model you train, from a one-line linear regression to a 70B-parameter language model, is the answer to &lt;em>some&lt;/em> optimization problem.&lt;/p></description></item><item><title>ML Math Derivations (3): Probability Theory and Statistical Inference</title><link>https://www.chenk.top/en/ml-math-derivations/03-probability-theory-and-statistical-inference/</link><pubDate>Thu, 22 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/03-probability-theory-and-statistical-inference/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/03-Probability-Theory-and-Statistical-Inference/illustration_1.png" alt="ML Math Derivations (3): Probability Theory and Statistical Inference — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>In 1912, Ronald Fisher introduced &lt;strong>maximum likelihood estimation&lt;/strong> in a short paper that quietly redefined statistics. His insight was almost embarrassingly simple: &lt;em>if a parameter setting makes the observed data extremely likely, it is probably correct&lt;/em>. Almost every modern learning algorithm — from logistic regression to large language models — descends from this idea.&lt;/p></description></item><item><title>ML Math Derivations (2): Linear Algebra and Matrix Theory</title><link>https://www.chenk.top/en/ml-math-derivations/02-linear-algebra-and-matrix-theory/</link><pubDate>Wed, 21 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/02-linear-algebra-and-matrix-theory/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/02-Linear-Algebra-and-Matrix-Theory/illustration_1.png" alt="ML Math Derivations (2): Linear Algebra and Matrix Theory — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="why-this-chapter-and-whats-different" class="heading-anchor">Why this chapter, and what&amp;rsquo;s different&lt;a href="#why-this-chapter-and-whats-different" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>If you have already worked through a standard linear-algebra course you have seen most of these objects. &lt;strong>This chapter is not that course.&lt;/strong> It is the &lt;em>ML practitioner&amp;rsquo;s slice&lt;/em> of linear algebra: the half-dozen ideas that actually appear when you implement gradient descent, run PCA, train a neural net, or read a paper.&lt;/p></description></item><item><title>ML Math Derivations (1): Introduction and Mathematical Foundations</title><link>https://www.chenk.top/en/ml-math-derivations/01-introduction-and-mathematical-foundations/</link><pubDate>Tue, 20 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/01-introduction-and-mathematical-foundations/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/01-Introduction-and-Mathematical-Foundations/illustration_1.png" alt="ML Math Derivations (1): Introduction and Mathematical Foundations — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-this-chapter-does" class="heading-anchor">What this chapter does&lt;a href="#what-this-chapter-does" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>In 2005 Google Research showed, on a public benchmark, that a statistical translation model trained on raw bilingual text could outperform decades of carefully engineered linguistic rules. The conclusion was uncomfortable for the experts of the day, but mathematically liberating: &lt;strong>a system that has never been told the rules of a language can still recover them, given enough examples.&lt;/strong> Why?&lt;/p></description></item><item><title>Transfer Learning (10): Continual Learning</title><link>https://www.chenk.top/en/transfer-learning/10-continual-learning/</link><pubDate>Tue, 24 Jun 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/transfer-learning/10-continual-learning/</guid><description>&lt;p>You can teach yourself to play guitar this year and you will still remember how to ride a bike. A neural network cannot. Fine-tune a vision model on CIFAR-10 then on SVHN, evaluate it on CIFAR-10 again, and accuracy collapses to barely above chance. The phenomenon is called &lt;strong>catastrophic forgetting&lt;/strong>, and overcoming it is the central problem of &lt;strong>continual learning (CL)&lt;/strong>: a learner that absorbs a stream of tasks &lt;span class="math-inline">$\mathcal{T}_1, \mathcal{T}_2, \ldots$&lt;/span>
 without re-accessing past data and without losing what it already knew.&lt;/p></description></item><item><title>Transfer Learning (9): Parameter-Efficient Fine-Tuning</title><link>https://www.chenk.top/en/transfer-learning/09-parameter-efficient-fine-tuning/</link><pubDate>Wed, 18 Jun 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/transfer-learning/09-parameter-efficient-fine-tuning/</guid><description>&lt;p>How do you fine-tune a 175B-parameter model on a single GPU? Update only 0.1% of the parameters. Parameter-Efficient Fine-Tuning (PEFT) makes this possible — and on most benchmarks it matches full fine-tuning. This post derives the math behind LoRA, Adapter, Prefix-Tuning, Prompt-Tuning, BitFit and QLoRA, and gives you a single picture for choosing among them.&lt;/p>
&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/transfer-learning/09-parameter-efficient-fine-tuning/illustration_1.png" alt="Transfer Learning (9): Parameter-Efficient Fine-Tuning — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>Why the low-rank assumption holds for weight updates&lt;/li>
&lt;li>LoRA: derivation, initialization, scaling, and weight merging&lt;/li>
&lt;li>Adapter: bottleneck architecture and where to insert it&lt;/li>
&lt;li>Prefix-Tuning vs Prompt-Tuning vs P-Tuning v2&lt;/li>
&lt;li>QLoRA: how 4-bit quantisation gets a 65B model on one GPU&lt;/li>
&lt;li>Method comparison and a selection guide grounded in GLUE numbers&lt;/li>
&lt;/ul>
&lt;h2 id="prerequisites" class="heading-anchor">Prerequisites&lt;a href="#prerequisites" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>Transformer architecture (attention, FFN, residual + LayerNorm)&lt;/li>
&lt;li>Matrix decomposition basics (rank, SVD)&lt;/li>
&lt;li>Transfer learning fundamentals (Parts 1-6)&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="the-full-fine-tuning-problem" class="heading-anchor">The Full Fine-Tuning Problem&lt;a href="#the-full-fine-tuning-problem" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;span class="math-block">$$\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})$$&lt;/span>
&lt;p>
For GPT-3 (175B params) this means roughly &lt;strong>700 GB of FP32 weights&lt;/strong>, plus gradients, plus optimiser states — and one full copy per task. Even after the model fits, the per-task storage and serving cost is brutal: 100 customers means 100 copies of a 700 GB checkpoint.&lt;/p></description></item></channel></rss>