<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Machine Learning on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/machine-learning/</link><description>Recent content in Machine Learning on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Fri, 08 May 2026 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/machine-learning/index.xml" rel="self" type="application/rss+xml"/><item><title>Alibaba Cloud Full Stack (11): PAI — The ML Platform</title><link>https://www.chenk.top/en/aliyun-fullstack/11-pai-ml-platform/</link><pubDate>Fri, 08 May 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/aliyun-fullstack/11-pai-ml-platform/</guid><description>&lt;p>Training a model on a single GPU is fun. Deploying it to handle 1,000 requests per second without failing is what separates experiments from products. PAI handles both.&lt;/p>
&lt;p>PAI (Platform for AI) is Alibaba Cloud&amp;rsquo;s managed ML platform. It&amp;rsquo;s not just one product; it&amp;rsquo;s five products in a trench coat, sharing a console. These include a notebook environment for exploration, a distributed training service for scale, a model serving platform for production, a visual pipeline designer for those who prefer dragging boxes, and a model gallery for one-click deployment of open-source models. After eighteen months of running real LLM workloads on it, I can say that the individual components range from excellent (EAS) to good enough (Designer). The whole platform is genuinely greater than the sum of its parts once you understand how they connect.&lt;/p></description></item><item><title>Aliyun PAI (2): PAI-DSW — Notebooks That Don't Eat Your Weights</title><link>https://www.chenk.top/en/aliyun-pai/02-pai-dsw-notebook/</link><pubDate>Fri, 06 Mar 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/aliyun-pai/02-pai-dsw-notebook/</guid><description>&lt;p>Every time I onboard a new ML engineer to PAI the first day looks the same. They start a DSW instance, &lt;code>pip install&lt;/code> their world, train for an hour, restart the kernel for some reason, and then ask me where their model file went. The honest answer — &amp;ldquo;in &lt;code>/root&lt;/code> on a node that no longer exists&amp;rdquo; — is the kind of lesson you only need to learn once. This article is the version of that lesson you read in advance.&lt;/p></description></item><item><title>Aliyun PAI (1): Platform Overview and the Product Family Map</title><link>https://www.chenk.top/en/aliyun-pai/01-platform-overview/</link><pubDate>Thu, 05 Mar 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/aliyun-pai/01-platform-overview/</guid><description>&lt;p>If your team trains or serves models on Alibaba Cloud, you&amp;rsquo;ll eventually use the PAI console. PAI is the umbrella; underneath it are the actual workhorses — a notebook product, a distributed training service, a model-serving service, and a few GUI/quick-deploy layers. After about eighteen months of running real LLM workloads on it for an AI marketing platform, this series is the field guide I wish I had before deploying my first endpoint.&lt;/p></description></item><item><title>ML Math Derivations (20): Regularization and Model Selection</title><link>https://www.chenk.top/en/ml-math-derivations/20-regularization-and-model-selection/</link><pubDate>Sun, 08 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/20-regularization-and-model-selection/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/20-Regularization-and-Model-Selection/illustration_1.png" alt="ML Math Derivations (20): Regularization and Model Selection — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>A 100-million-parameter network trained on 50,000 images &lt;em>should&lt;/em> overfit catastrophically. Modern deep networks generalise anyway. &lt;strong>Why?&lt;/strong> Two ingredients: &lt;em>regularisation&lt;/em> (techniques that constrain capacity) and &lt;em>generalisation theory&lt;/em> (mathematics that says when learning works at all). This article is the closing chapter of the series, and we use it to gather every tool we have built — least squares, MAP estimation, optimisation, EM, neural networks — and turn them on the deepest open question in the field: &lt;em>why does learning generalise?&lt;/em>&lt;/p></description></item><item><title>ML Math Derivations (19): Neural Networks and Backpropagation</title><link>https://www.chenk.top/en/ml-math-derivations/19-neural-networks-and-backpropagation/</link><pubDate>Sat, 07 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/19-neural-networks-and-backpropagation/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/19-Neural-Networks-and-Backpropagation/illustration_1.png" alt="ML Math Derivations (19): Neural Networks and Backpropagation — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> In 1969 Minsky and Papert proved that a single perceptron could not learn XOR, and connectionist research went into a fifteen-year freeze. The thaw came when Rumelhart, Hinton and Williams realised that &lt;em>stacking&lt;/em> perceptrons makes the problem disappear — and that the same chain rule everyone learns in calculus, applied carefully, computes every gradient in a multilayer network for the cost of a single extra forward pass. That algorithm is backpropagation. Every gradient in every Transformer, every diffusion model, every GPT trained today still runs on it.&lt;/p></description></item><item><title>ML Math Derivations (18): Clustering Algorithms</title><link>https://www.chenk.top/en/ml-math-derivations/18-clustering-algorithms/</link><pubDate>Fri, 06 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/18-clustering-algorithms/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/18-Clustering-Algorithms/illustration_1.png" alt="ML Math Derivations (18): Clustering Algorithms — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>A million customer records arrive with no labels. Can you discover meaningful groups automatically? That is &lt;strong>clustering&lt;/strong>, the most fundamental unsupervised learning task. Unlike classification, clustering forces you to first answer a slippery question: &lt;em>what does &amp;ldquo;similar&amp;rdquo; even mean?&lt;/em> Every clustering algorithm is, at heart, a different answer to that question — a different geometric, probabilistic, or graph-theoretic prior on what a &amp;ldquo;group&amp;rdquo; is.&lt;/p></description></item><item><title>ML Math Derivations (17): Dimensionality Reduction and PCA</title><link>https://www.chenk.top/en/ml-math-derivations/17-dimensionality-reduction-and-pca/</link><pubDate>Thu, 05 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/17-dimensionality-reduction-and-pca/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/17-Dimensionality-Reduction-and-PCA/illustration_1.png" alt="ML Math Derivations (17): Dimensionality Reduction and PCA — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>Feed a clustering algorithm &lt;span class="math-inline">$10{,}000$&lt;/span>
-dimensional data and it will most likely fail — not because the algorithm is broken, but because &lt;strong>high-dimensional space is a hostile environment for distance-based learning&lt;/strong>. Volumes evaporate into thin shells, the ratio of nearest- to farthest-neighbour distances tends to &lt;span class="math-inline">$1$&lt;/span>
, and &amp;ldquo;closeness&amp;rdquo; stops carrying information. Dimensionality reduction is the response: project the data into a lower-dimensional space while keeping the structure that actually matters.&lt;/p></description></item><item><title>ML Math Derivations (16): Conditional Random Fields</title><link>https://www.chenk.top/en/ml-math-derivations/16-conditional-random-fields/</link><pubDate>Wed, 04 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/16-conditional-random-fields/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/16-Conditional-Random-Fields/illustration_1.png" alt="ML Math Derivations (16): Conditional Random Fields — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>Named entity recognition, POS tagging, information extraction — every one of these tasks asks you to label each element of a sequence. HMMs (&lt;a href="https://www.chenk.top/en/ml-math-derivations/15-hidden-markov-models">Part 15&lt;/a>
) attack this problem &lt;strong>generatively&lt;/strong> by modelling the joint distribution &lt;span class="math-inline">$P(\mathbf{X},\mathbf{Y})$&lt;/span>
, but to make the joint factorise they pay a steep price: each observation is assumed independent of everything except its own hidden label. In real text, whether &lt;em>bank&lt;/em> is a noun or a verb depends on the preceding and following words, the suffix, capitalization, and dictionary lookups — all these features together.&lt;/p></description></item><item><title>ML Math Derivations (15): Hidden Markov Models</title><link>https://www.chenk.top/en/ml-math-derivations/15-hidden-markov-models/</link><pubDate>Tue, 03 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/15-hidden-markov-models/</guid><description>&lt;p>You hear footsteps behind you in the fog. You can&amp;rsquo;t see the walker, only the sounds. From the rhythm and pitch — short, soft, hurried — can you guess whether they are walking, running, or limping? And if you observed an entire sequence, which gait sequence is most likely? How likely is &lt;em>any&lt;/em> sequence of sounds under your model of how walking works?&lt;/p>
&lt;p>These are the &lt;strong>three problems of HMMs&lt;/strong>, and the surprise is that all three reduce to one trick: write the joint &lt;span class="math-inline">$P(\mathbf{O}, \mathbf{I})$&lt;/span>
 as a product of local factors along time, then &lt;strong>share sub-computations across time&lt;/strong> with dynamic programming. Brute force costs &lt;span class="math-inline">$O(N^T)$&lt;/span>
. Forward-Backward, Viterbi, and Baum-Welch all cost &lt;span class="math-inline">$O(N^2 T)$&lt;/span>
. The exponent collapses because the Markov assumption makes the future conditionally independent of the past given the present.&lt;/p></description></item><item><title>ML Math Derivations (14): Variational Inference and Variational EM</title><link>https://www.chenk.top/en/ml-math-derivations/14-variational-inference-and-variational-em/</link><pubDate>Mon, 02 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/14-variational-inference-and-variational-em/</guid><description>&lt;p>When the posterior &lt;span class="math-inline">$p(\mathbf{z}\mid\mathbf{x})$&lt;/span>
 is intractable, you have two roads. &lt;strong>Sampling&lt;/strong> (MCMC) walks a Markov chain whose stationary distribution is the posterior — eventually exact, but slow and hard to diagnose. &lt;strong>Variational inference&lt;/strong> (VI) instead picks a simple family &lt;span class="math-inline">$\mathcal{Q}$&lt;/span>
 of distributions and finds the member &lt;span class="math-inline">$q^\star\in\mathcal{Q}$&lt;/span>
 that lies closest to the true posterior. Inference becomes optimization, and the same machinery that fits a neural network now fits a Bayesian model.&lt;/p>
&lt;p>This post derives VI from a single identity, builds the mean-field algorithm and CAVI from that identity, connects EM and variational EM as special cases, and ends with the reparameterization trick that turns the ELBO into a stochastic objective compatible with autodiff — the engine inside every VAE.&lt;/p></description></item><item><title>ML Math Derivations (13): EM Algorithm and GMM</title><link>https://www.chenk.top/en/ml-math-derivations/13-em-algorithm-and-gmm/</link><pubDate>Sun, 01 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/13-em-algorithm-and-gmm/</guid><description>&lt;p>When data has hidden structure — like an unobserved cluster label, a missing feature, or an unseen topic — maximum likelihood becomes challenging. The log of a sum has no closed form, and gradient methods get entangled with the latent variables. The &lt;strong>EM algorithm&lt;/strong> sidesteps the difficulty with a deceptively simple idea: alternate between &lt;em>guessing&lt;/em> the hidden variables under a posterior (E-step) and &lt;em>fitting&lt;/em> the parameters as if those guesses were true (M-step). Each iteration is mathematically guaranteed to push the likelihood up. This post derives EM from first principles, proves the monotone-ascent property using Jensen&amp;rsquo;s inequality, and explores its most famous application: &lt;strong>Gaussian Mixture Models (GMM)&lt;/strong> — the soft, elliptical generalization of K-means.&lt;/p></description></item><item><title>ML Math Derivations (12): XGBoost and LightGBM</title><link>https://www.chenk.top/en/ml-math-derivations/12-xgboost-and-lightgbm/</link><pubDate>Sat, 31 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/12-xgboost-and-lightgbm/</guid><description>&lt;p>XGBoost and LightGBM are the two libraries that quietly win most tabular-data battles &amp;mdash; on Kaggle leaderboards, in fraud-detection pipelines, in ad ranking, in churn models. They share the same backbone (gradient-boosted trees, &lt;a href="https://www.chenk.top/en/ml-math-derivations/11-ensemble-learning/">Part 11&lt;/a>
) but make very different engineering bets:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>XGBoost&lt;/strong> sharpens the &lt;em>math&lt;/em>: it brings the second derivative of the loss into the objective, regularises the tree itself, and turns split selection into a closed-form score.&lt;/li>
&lt;li>&lt;strong>LightGBM&lt;/strong> sharpens the &lt;em>systems&lt;/em>: it bins features into a small histogram, grows trees leaf-by-leaf, throws away uninformative samples (GOSS) and bundles mutually exclusive sparse features (EFB).&lt;/li>
&lt;/ul>
&lt;p>The result is two tools that look interchangeable from the API but behave very differently when &lt;span class="math-inline">$N$&lt;/span>
 or &lt;span class="math-inline">$d$&lt;/span>
 becomes large. This post derives every formula behind those choices so you can read a tuning guide and know &lt;em>why&lt;/em> each knob exists.&lt;/p></description></item><item><title>ML Math Derivations (11): Ensemble Learning</title><link>https://www.chenk.top/en/ml-math-derivations/11-ensemble-learning/</link><pubDate>Fri, 30 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/11-ensemble-learning/</guid><description>&lt;p>Why do mediocre classifiers in a committee outperform a single brilliant one? The answer is straightforward: averaging reduces variance, sequential reweighting reduces bias, and a bit of randomization breaks the correlation that would otherwise negate these benefits. This post delves into the math behind this — bias-variance decomposition, bootstrap aggregating, AdaBoost as forward stagewise minimization of exponential loss, and gradient boosting as gradient descent in function space.&lt;/p>
&lt;p>By the end, you should be able to look at any ensemble method and say what it reduces, why it works, and when it fails.&lt;/p></description></item><item><title>ML Math Derivations (10): Semi-Naive Bayes and Bayesian Networks</title><link>https://www.chenk.top/en/ml-math-derivations/10-semi-naive-bayes-and-bayesian-networks/</link><pubDate>Thu, 29 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/10-semi-naive-bayes-and-bayesian-networks/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> Naive Bayes assumes every feature is conditionally independent given the class. It is a convenient lie — one that lets us train in a single pass over the data, but one that classifiers based on tree structures and small graphs can systematically beat by a few accuracy points on virtually every UCI benchmark. This part walks the spectrum from &amp;ldquo;no dependencies&amp;rdquo; (Naive Bayes) to &amp;ldquo;all dependencies&amp;rdquo; (full joint), showing the three sweet spots that practitioners actually use: SPODE, TAN and AODE. The same factorisation idea, taken to its general form, is the Bayesian network.&lt;/p></description></item><item><title>ML Math Derivations (9): Naive Bayes</title><link>https://www.chenk.top/en/ml-math-derivations/09-naive-bayes/</link><pubDate>Wed, 28 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/09-naive-bayes/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook:&lt;/strong> A spam filter that trains in milliseconds, scales to a million features, has &lt;em>no hyperparameters worth tuning&lt;/em>, and still beats much fancier models on short-text problems. Naive Bayes pulls this off by making one outrageous assumption — every feature is independent given the class — and refusing to apologise for it. The assumption is wrong on essentially every real dataset, yet the classifier works. Understanding &lt;em>why&lt;/em> is a tour through generative modelling, MAP estimation, Dirichlet priors, and the bias–variance tradeoff. This article walks the entire path.&lt;/p></description></item><item><title>ML Math Derivations (8): Support Vector Machines</title><link>https://www.chenk.top/en/ml-math-derivations/08-support-vector-machines/</link><pubDate>Tue, 27 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/08-support-vector-machines/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> You have two clouds of points and infinitely many lines that separate them. Which line is &amp;ldquo;best&amp;rdquo;? SVM gives a startlingly geometric answer: the line that sits in the middle of the &lt;em>widest empty corridor&lt;/em> between the two classes. Push that single idea through Lagrangian duality and it produces a sparse model (only the points on the corridor wall matter), a quadratic program with a global optimum, and — almost as a free gift — the kernel trick that lets the same linear machinery carve curved boundaries in infinite-dimensional spaces.&lt;/p></description></item><item><title>ML Math Derivations (7): Decision Trees</title><link>https://www.chenk.top/en/ml-math-derivations/07-decision-trees/</link><pubDate>Mon, 26 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/07-decision-trees/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> A decision tree mimics how humans actually decide things: ask a question, branch on the answer, ask the next question. The math under that intuition is surprisingly rich — entropy from information theory tells us &lt;em>which&lt;/em> question to ask first, the Gini index gives a cheaper proxy that lands on essentially the same trees, and cost-complexity pruning gives a principled way to stop the tree from memorising noise. Almost every modern boosted ensemble (XGBoost, LightGBM, CatBoost) is just a clever sum of these objects, so getting the foundations right pays off many times over.&lt;/p></description></item><item><title>ML Math Derivations (6): Logistic Regression and Classification</title><link>https://www.chenk.top/en/ml-math-derivations/06-logistic-regression-and-classification/</link><pubDate>Sun, 25 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/06-logistic-regression-and-classification/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> Linear regression maps inputs to any real number — but what if the output has to be a probability between 0 and 1? Logistic regression solves this with one elegant trick: a sigmoid squashing function. Despite its name, logistic regression is a &lt;em>classification&lt;/em> algorithm, and its math underpins every neuron in every modern neural network.&lt;/p>
&lt;/blockquote>
&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/06-Logistic-Regression-and-Classification/illustration_1.png" alt="ML Math Derivations (6): Logistic Regression and Classification — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>Why sigmoid is the natural way to turn a real-valued score into a probability, and why its derivative is so clean.&lt;/li>
&lt;li>How cross-entropy loss falls out of maximum likelihood estimation in two lines.&lt;/li>
&lt;li>Why cross-entropy beats MSE for classification — a vanishing-gradient argument made visible.&lt;/li>
&lt;li>The full gradient and Hessian for both binary and multi-class (softmax) cases, and why the loss is convex.&lt;/li>
&lt;li>L1, L2 and elastic-net regularization, and the Bayesian priors hiding behind them.&lt;/li>
&lt;li>Decision-boundary geometry and the threshold-free metrics (ROC / PR / AUC) that you actually need under class imbalance.&lt;/li>
&lt;/ul>
&lt;h2 id="prerequisites" class="heading-anchor">Prerequisites&lt;a href="#prerequisites" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;ul>
&lt;li>Calculus: chain rule, partial derivatives.&lt;/li>
&lt;li>Linear algebra: matrix multiplication, transpose.&lt;/li>
&lt;li>Probability: Bernoulli and categorical distributions, likelihood.&lt;/li>
&lt;li>Familiarity with &lt;a href="https://www.chenk.top/en/ml-math-derivations/05-linear-regression">Part 5: Linear Regression&lt;/a>
.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="from-linear-models-to-probabilistic-classification" class="heading-anchor">From Linear Models to Probabilistic Classification&lt;a href="#from-linear-models-to-probabilistic-classification" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;h3 id="the-problem-with-raw-linear-output" class="heading-anchor">The Problem with Raw Linear Output&lt;a href="#the-problem-with-raw-linear-output" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h3>&lt;p>Linear regression gives us &lt;span class="math-inline">$\hat y = \mathbf{w}^\top \mathbf{x}$&lt;/span>
, which is unbounded. For classification, two things go wrong:&lt;/p></description></item><item><title>ML Math Derivations (5): Linear Regression</title><link>https://www.chenk.top/en/ml-math-derivations/05-linear-regression/</link><pubDate>Sat, 24 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/05-linear-regression/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>Hook.&lt;/strong> In 1886 Francis Galton noticed something strange about heredity: children of unusually tall (or short) parents tended to be closer to the average than their parents were. He called this drift toward the mean &lt;em>regression&lt;/em>, and the name stuck. The statistical curiosity grew up into the most consequential model in machine learning — not because linear regression is powerful on its own, but because almost every other algorithm (logistic regression, neural networks, kernel methods) is some twist on the same idea: &lt;strong>fit a line, but in the right space.&lt;/strong>&lt;/p></description></item><item><title>ML Math Derivations (4): Convex Optimization Theory</title><link>https://www.chenk.top/en/ml-math-derivations/04-convex-optimization-theory/</link><pubDate>Fri, 23 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/04-convex-optimization-theory/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/04-Convex-Optimization-Theory/illustration_1.png" alt="ML Math Derivations (4): Convex Optimization Theory — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>In 1947, George Dantzig proposed the simplex method for linear programming, and a working theory of optimization was born. Eight decades later, optimization has become the engine of machine learning: every model you train, from a one-line linear regression to a 70B-parameter language model, is the answer to &lt;em>some&lt;/em> optimization problem.&lt;/p></description></item><item><title>ML Math Derivations (3): Probability Theory and Statistical Inference</title><link>https://www.chenk.top/en/ml-math-derivations/03-probability-theory-and-statistical-inference/</link><pubDate>Thu, 22 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/03-probability-theory-and-statistical-inference/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/03-Probability-Theory-and-Statistical-Inference/illustration_1.png" alt="ML Math Derivations (3): Probability Theory and Statistical Inference — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-you-will-learn" class="heading-anchor">What You Will Learn&lt;a href="#what-you-will-learn" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>In 1912, Ronald Fisher introduced &lt;strong>maximum likelihood estimation&lt;/strong> in a short paper that quietly redefined statistics. His insight was almost embarrassingly simple: &lt;em>if a parameter setting makes the observed data extremely likely, it is probably correct&lt;/em>. Almost every modern learning algorithm — from logistic regression to large language models — descends from this idea.&lt;/p></description></item><item><title>ML Math Derivations (2): Linear Algebra and Matrix Theory</title><link>https://www.chenk.top/en/ml-math-derivations/02-linear-algebra-and-matrix-theory/</link><pubDate>Wed, 21 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/02-linear-algebra-and-matrix-theory/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/02-Linear-Algebra-and-Matrix-Theory/illustration_1.png" alt="ML Math Derivations (2): Linear Algebra and Matrix Theory — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="why-this-chapter-and-whats-different" class="heading-anchor">Why this chapter, and what&amp;rsquo;s different&lt;a href="#why-this-chapter-and-whats-different" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>If you have already worked through a standard linear-algebra course you have seen most of these objects. &lt;strong>This chapter is not that course.&lt;/strong> It is the &lt;em>ML practitioner&amp;rsquo;s slice&lt;/em> of linear algebra: the half-dozen ideas that actually appear when you implement gradient descent, run PCA, train a neural net, or read a paper.&lt;/p></description></item><item><title>ML Math Derivations (1): Introduction and Mathematical Foundations</title><link>https://www.chenk.top/en/ml-math-derivations/01-introduction-and-mathematical-foundations/</link><pubDate>Tue, 20 Jan 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/ml-math-derivations/01-introduction-and-mathematical-foundations/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/ml-math-derivations/01-Introduction-and-Mathematical-Foundations/illustration_1.png" alt="ML Math Derivations (1): Introduction and Mathematical Foundations — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-this-chapter-does" class="heading-anchor">What this chapter does&lt;a href="#what-this-chapter-does" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>In 2005 Google Research showed, on a public benchmark, that a statistical translation model trained on raw bilingual text could outperform decades of carefully engineered linguistic rules. The conclusion was uncomfortable for the experts of the day, but mathematically liberating: &lt;strong>a system that has never been told the rules of a language can still recover them, given enough examples.&lt;/strong> Why?&lt;/p></description></item><item><title>Symplectic Geometry and Structure-Preserving Neural Networks</title><link>https://www.chenk.top/en/standalone/symplectic-geometry-and-structure-preserving-neural-networks/</link><pubDate>Mon, 28 Jul 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/standalone/symplectic-geometry-and-structure-preserving-neural-networks/</guid><description>&lt;p>Train a vanilla feedforward network to predict a one-dimensional harmonic oscillator. Validate it on the next ten time steps — the error is fine. Now roll it out for a thousand steps. The orbit no longer closes, the energy creeps upward, and what should be periodic motion turns into a slow spiral. The network learned to fit data points but never learned the &lt;em>physics&lt;/em>. Structure-preserving networks fix this by incorporating geometric invariants — energy conservation, the symplectic 2-form, and the Euler-Lagrange equations — directly into the architecture, ensuring the learned model cannot violate them no matter how long you integrate.&lt;/p></description></item><item><title>Transfer Learning (1): Fundamentals and Core Concepts</title><link>https://www.chenk.top/en/transfer-learning/01-fundamentals-and-core-concepts/</link><pubDate>Thu, 01 May 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/transfer-learning/01-fundamentals-and-core-concepts/</guid><description>&lt;p>You spent two weeks training an ImageNet classifier on a rack of GPUs. On Monday morning, your team lead asks for a chest X-ray pneumonia model, and the entire labeled dataset is &lt;strong>two hundred images&lt;/strong>. Do you book another two weeks of GPU time and start from scratch?&lt;/p>
&lt;p>Of course not. You use what the ImageNet model already knows about edges, textures, and shapes, swap out the last layer, and fine-tune on the X-rays. Two hours later, you have a model that beats anything you could have trained from random weights with so little data. That&amp;rsquo;s &lt;strong>transfer learning&lt;/strong>, and it&amp;rsquo;s why most real-world deep learning projects ship in days instead of months.&lt;/p></description></item><item><title>Essence of Linear Algebra (15): Linear Algebra in Machine Learning</title><link>https://www.chenk.top/en/linear-algebra/15-linear-algebra-in-machine-learning/</link><pubDate>Wed, 09 Apr 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/linear-algebra/15-linear-algebra-in-machine-learning/</guid><description>&lt;p>Ask any senior ML engineer &amp;ldquo;what math do you actually use day to day?&amp;rdquo; and the answer is almost always &lt;strong>linear algebra&lt;/strong>. Calculus shows up in derivations; probability shows up in modeling; but the runtime of a real ML system is dominated by matrix-vector multiplies, decompositions, and projections. PyTorch&amp;rsquo;s &lt;code>Linear&lt;/code>, scikit-learn&amp;rsquo;s &lt;code>PCA&lt;/code>, Spark MLlib&amp;rsquo;s &lt;code>ALS&lt;/code>, and a Transformer&amp;rsquo;s attention head are all the same primitive in different costumes.&lt;/p>
&lt;p>This chapter covers the algorithms used in production ML systems — PCA, LDA, SVM with kernels, matrix factorization for recommenders, regularized linear regression, neural network layers, and attention — and explains the linear algebra behind each. We focus on intuition first, then geometry, and finally formulas.&lt;/p></description></item><item><title>Probability and Statistics (8): Bayesian Statistics — Priors, Posteriors, and Why Frequentists Argue</title><link>https://www.chenk.top/en/probability-statistics/08-bayesian-thinking/</link><pubDate>Fri, 30 Aug 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/probability-statistics/08-bayesian-thinking/</guid><description>&lt;p>Two statisticians walk into a bar. One says: &amp;ldquo;The probability of rain tomorrow is 30%.&amp;rdquo; The other replies: &amp;ldquo;Probability is a long-run frequency. Since tomorrow only happens once, that statement is meaningless.&amp;rdquo; The first one says: &amp;ldquo;It quantifies my uncertainty about a unique event.&amp;rdquo; They proceed to argue for the rest of the evening.&lt;/p>
&lt;p>This, roughly, is the Bayesian-frequentist debate. It&amp;rsquo;s not about who&amp;rsquo;s right — both frameworks are mathematically consistent. It&amp;rsquo;s about what &amp;ldquo;probability&amp;rdquo; means and how that interpretation shapes the tools you use. Having worked through six articles of largely frequentist reasoning, we now develop the Bayesian perspective: parameters are random, data update our beliefs, and uncertainty is quantified through distributions rather than confidence intervals.&lt;/p></description></item><item><title>PDE and ML (8): Reaction-Diffusion Systems and Graph Neural Networks</title><link>https://www.chenk.top/en/pde-ml/08-reaction-diffusion-systems/</link><pubDate>Wed, 14 Aug 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/pde-ml/08-reaction-diffusion-systems/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/pde-ml/08-Reaction-Diffusion-Systems/illustration_1.png" alt="PDE and ML (8): Reaction-Diffusion Systems and Graph Neural Networks — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;p>Anyone who has trained a deep GNN has seen it collapse — past a dozen or so layers, every node&amp;rsquo;s embedding becomes nearly identical and the model goes mush. There is a name for this — &lt;strong>over-smoothing&lt;/strong> — and the underlying math is surprisingly clean: &lt;strong>GNN message passing is essentially a diffusion equation on the graph&lt;/strong>, and diffusion&amp;rsquo;s long-time behavior is to flatten everything to a constant.&lt;/p></description></item><item><title>PDE and ML (7): Diffusion Models and Score Matching</title><link>https://www.chenk.top/en/pde-ml/07-diffusion-models/</link><pubDate>Tue, 30 Jul 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/pde-ml/07-diffusion-models/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/pde-ml/07-Diffusion-Models/illustration_1.png" alt="PDE and ML (7): Diffusion Models and Score Matching — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;p>The output side of a diffusion model is familiar: a high-quality image. The training objective, on the other hand, looks counter-intuitive at first sight — &lt;strong>add noise to the data until it is fully Gaussian, then learn to denoise step by step&lt;/strong>. Why is this detour more effective than learning the data distribution directly?&lt;/p>
&lt;p>The answer is hidden in PDEs. The forward noising process is a &lt;strong>heat equation&lt;/strong> (or, more generally, a Fokker–Planck equation), and it admits a reverse-time version — provided we know the score (the gradient of the log-density) at every time. &lt;strong>Score matching&lt;/strong> is the standard way to learn that score. From this angle, DDPM, DDIM, and score-based SDEs are not three different algorithms but three discretizations of the same PDE story.&lt;/p></description></item><item><title>PDE and ML (6): Continuous Normalizing Flows and Neural ODE</title><link>https://www.chenk.top/en/pde-ml/06-continuous-normalizing-flows/</link><pubDate>Mon, 15 Jul 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/pde-ml/06-continuous-normalizing-flows/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/pde-ml/06-Continuous-Normalizing-Flows/illustration_1.png" alt="PDE and ML (6): Continuous Normalizing Flows and Neural ODE — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;p>How do you turn an isotropic Gaussian into a photograph of a cat?&lt;/p>
&lt;p>Normalizing flows give the most direct answer: stack a sequence of invertible transformations and let them push the simple distribution into the complex one. This article&amp;rsquo;s continuous version (CNF) takes that idea to the limit — let the step size go to zero and the discrete chain becomes an ODE. Invertibility is automatic, and the change of density is governed by the instantaneous change of variables formula.&lt;/p></description></item><item><title>PDE and ML (5): Symplectic Geometry and Structure-Preserving Networks</title><link>https://www.chenk.top/en/pde-ml/05-symplectic-geometry/</link><pubDate>Sun, 30 Jun 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/pde-ml/05-symplectic-geometry/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/pde-ml/05-Symplectic-Geometry/illustration_1.png" alt="PDE and ML (5): Symplectic Geometry and Structure-Preserving Networks — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;p>A pendulum keeps swinging for a very long time without slowly winding down — energy is conserved. The Earth orbits the Sun for billions of years without flying off — angular momentum is conserved. Behind every &amp;ldquo;this quantity stays constant&amp;rdquo; lurks a piece of geometry called &lt;strong>symplectic structure&lt;/strong>.&lt;/p>
&lt;p>Train a vanilla Neural ODE on pendulum data: after a few hundred steps the energy drifts. The network can fit the short-term trajectory just fine; what it can&amp;rsquo;t fit is the long-time conservation law. &lt;strong>Structure-preserving networks&lt;/strong> (HNN, LNN, SympNet) take a different approach: bake the conservation law into the architecture so the network &lt;em>cannot&lt;/em> violate it.&lt;/p></description></item><item><title>PDE and ML (3): Variational Principles and Optimization</title><link>https://www.chenk.top/en/pde-ml/03-variational-principles/</link><pubDate>Fri, 31 May 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/pde-ml/03-variational-principles/</guid><description>&lt;p>What is the essence of neural-network training? When we run gradient descent in a high-dimensional parameter space, is there a deeper continuous-time dynamics at work? As the network width tends to infinity, does discrete parameter updating converge to some elegant partial differential equation? The answers live at the intersection of the calculus of variations, optimal transport, and PDE theory.&lt;/p>
&lt;p>The last decade of deep-learning success has rested mostly on engineering intuition. Recently, however, mathematicians have made a striking observation: &lt;strong>viewing a neural network as a particle system on the space of probability measures&lt;/strong>, and studying its evolution under Wasserstein geometry, exposes the global structure of training — convergence guarantees, the role of over-parameterization, the meaning of initialization. The tool that makes this visible is &lt;strong>the variational principle&lt;/strong> — from least action in physics, through the JKO scheme of modern optimal transport, to the mean-field limit of neural networks.&lt;/p></description></item></channel></rss>