Machine Learning on Chen Kai Blog

Alibaba Cloud Full Stack (11): PAI — The ML Platform

Fri, 08 May 2026 09:00:00 +0000

Training a model on a single GPU is fun. Deploying it to handle 1,000 requests per second without failing is what separates experiments from products. PAI handles both.

PAI (Platform for AI) is Alibaba Cloud’s managed ML platform. It’s not just one product; it’s five products in a trench coat, sharing a console. These include a notebook environment for exploration, a distributed training service for scale, a model serving platform for production, a visual pipeline designer for those who prefer dragging boxes, and a model gallery for one-click deployment of open-source models. After eighteen months of running real LLM workloads on it, I can say that the individual components range from excellent (EAS) to good enough (Designer). The whole platform is genuinely greater than the sum of its parts once you understand how they connect.

Aliyun PAI (2): PAI-DSW — Notebooks That Don't Eat Your Weights

Fri, 06 Mar 2026 09:00:00 +0000

Every time I onboard a new ML engineer to PAI the first day looks the same. They start a DSW instance, pip install their world, train for an hour, restart the kernel for some reason, and then ask me where their model file went. The honest answer — “in /root on a node that no longer exists” — is the kind of lesson you only need to learn once. This article is the version of that lesson you read in advance.

Aliyun PAI (1): Platform Overview and the Product Family Map

Thu, 05 Mar 2026 09:00:00 +0000

If your team trains or serves models on Alibaba Cloud, you’ll eventually use the PAI console. PAI is the umbrella; underneath it are the actual workhorses — a notebook product, a distributed training service, a model-serving service, and a few GUI/quick-deploy layers. After about eighteen months of running real LLM workloads on it for an AI marketing platform, this series is the field guide I wish I had before deploying my first endpoint.

ML Math Derivations (20): Regularization and Model Selection

Sun, 08 Feb 2026 09:00:00 +0000

What You Will Learn#

A 100-million-parameter network trained on 50,000 images should overfit catastrophically. Modern deep networks generalise anyway. Why? Two ingredients: regularisation (techniques that constrain capacity) and generalisation theory (mathematics that says when learning works at all). This article is the closing chapter of the series, and we use it to gather every tool we have built — least squares, MAP estimation, optimisation, EM, neural networks — and turn them on the deepest open question in the field: why does learning generalise?

ML Math Derivations (19): Neural Networks and Backpropagation

Sat, 07 Feb 2026 09:00:00 +0000

Hook. In 1969 Minsky and Papert proved that a single perceptron could not learn XOR, and connectionist research went into a fifteen-year freeze. The thaw came when Rumelhart, Hinton and Williams realised that stacking perceptrons makes the problem disappear — and that the same chain rule everyone learns in calculus, applied carefully, computes every gradient in a multilayer network for the cost of a single extra forward pass. That algorithm is backpropagation. Every gradient in every Transformer, every diffusion model, every GPT trained today still runs on it.

ML Math Derivations (18): Clustering Algorithms

Fri, 06 Feb 2026 09:00:00 +0000

What You Will Learn#

A million customer records arrive with no labels. Can you discover meaningful groups automatically? That is clustering, the most fundamental unsupervised learning task. Unlike classification, clustering forces you to first answer a slippery question: what does “similar” even mean? Every clustering algorithm is, at heart, a different answer to that question — a different geometric, probabilistic, or graph-theoretic prior on what a “group” is.

ML Math Derivations (17): Dimensionality Reduction and PCA

Thu, 05 Feb 2026 09:00:00 +0000

What You Will Learn#

Feed a clustering algorithm $$10{,}000$$ -dimensional data and it will most likely fail — not because the algorithm is broken, but because high-dimensional space is a hostile environment for distance-based learning. Volumes evaporate into thin shells, the ratio of nearest- to farthest-neighbour distances tends to $$1$$ , and “closeness” stops carrying information. Dimensionality reduction is the response: project the data into a lower-dimensional space while keeping the structure that actually matters.

ML Math Derivations (16): Conditional Random Fields

Wed, 04 Feb 2026 09:00:00 +0000

What You Will Learn#

Named entity recognition, POS tagging, information extraction — every one of these tasks asks you to label each element of a sequence. HMMs (Part 15 ) attack this problem generatively by modelling the joint distribution $P(\mathbf{X},\mathbf{Y})$ , but to make the joint factorise they pay a steep price: each observation is assumed independent of everything except its own hidden label. In real text, whether bank is a noun or a verb depends on the preceding and following words, the suffix, capitalization, and dictionary lookups — all these features together.

ML Math Derivations (15): Hidden Markov Models

Tue, 03 Feb 2026 09:00:00 +0000

You hear footsteps behind you in the fog. You can’t see the walker, only the sounds. From the rhythm and pitch — short, soft, hurried — can you guess whether they are walking, running, or limping? And if you observed an entire sequence, which gait sequence is most likely? How likely is any sequence of sounds under your model of how walking works?

These are the three problems of HMMs, and the surprise is that all three reduce to one trick: write the joint $P(\mathbf{O}, \mathbf{I})$ as a product of local factors along time, then share sub-computations across time with dynamic programming. Brute force costs $$O(N^T)$$ . Forward-Backward, Viterbi, and Baum-Welch all cost $$O(N^2 T)$$ . The exponent collapses because the Markov assumption makes the future conditionally independent of the past given the present.

ML Math Derivations (14): Variational Inference and Variational EM

Mon, 02 Feb 2026 09:00:00 +0000

When the posterior $p(\mathbf{z}\mid\mathbf{x})$ is intractable, you have two roads. Sampling (MCMC) walks a Markov chain whose stationary distribution is the posterior — eventually exact, but slow and hard to diagnose. Variational inference (VI) instead picks a simple family $\mathcal{Q}$ of distributions and finds the member $q^\star\in\mathcal{Q}$ that lies closest to the true posterior. Inference becomes optimization, and the same machinery that fits a neural network now fits a Bayesian model.

This post derives VI from a single identity, builds the mean-field algorithm and CAVI from that identity, connects EM and variational EM as special cases, and ends with the reparameterization trick that turns the ELBO into a stochastic objective compatible with autodiff — the engine inside every VAE.

ML Math Derivations (13): EM Algorithm and GMM

Sun, 01 Feb 2026 09:00:00 +0000

When data has hidden structure — like an unobserved cluster label, a missing feature, or an unseen topic — maximum likelihood becomes challenging. The log of a sum has no closed form, and gradient methods get entangled with the latent variables. The EM algorithm sidesteps the difficulty with a deceptively simple idea: alternate between guessing the hidden variables under a posterior (E-step) and fitting the parameters as if those guesses were true (M-step). Each iteration is mathematically guaranteed to push the likelihood up. This post derives EM from first principles, proves the monotone-ascent property using Jensen’s inequality, and explores its most famous application: Gaussian Mixture Models (GMM) — the soft, elliptical generalization of K-means.

ML Math Derivations (12): XGBoost and LightGBM

Sat, 31 Jan 2026 09:00:00 +0000

XGBoost and LightGBM are the two libraries that quietly win most tabular-data battles — on Kaggle leaderboards, in fraud-detection pipelines, in ad ranking, in churn models. They share the same backbone (gradient-boosted trees, Part 11 ) but make very different engineering bets:

XGBoost sharpens the math: it brings the second derivative of the loss into the objective, regularises the tree itself, and turns split selection into a closed-form score.
LightGBM sharpens the systems: it bins features into a small histogram, grows trees leaf-by-leaf, throws away uninformative samples (GOSS) and bundles mutually exclusive sparse features (EFB).

The result is two tools that look interchangeable from the API but behave very differently when $$N$$ or $$d$$ becomes large. This post derives every formula behind those choices so you can read a tuning guide and know why each knob exists.

ML Math Derivations (11): Ensemble Learning

Fri, 30 Jan 2026 09:00:00 +0000

Why do mediocre classifiers in a committee outperform a single brilliant one? The answer is straightforward: averaging reduces variance, sequential reweighting reduces bias, and a bit of randomization breaks the correlation that would otherwise negate these benefits. This post delves into the math behind this — bias-variance decomposition, bootstrap aggregating, AdaBoost as forward stagewise minimization of exponential loss, and gradient boosting as gradient descent in function space.

By the end, you should be able to look at any ensemble method and say what it reduces, why it works, and when it fails.

ML Math Derivations (10): Semi-Naive Bayes and Bayesian Networks

Thu, 29 Jan 2026 09:00:00 +0000

Hook. Naive Bayes assumes every feature is conditionally independent given the class. It is a convenient lie — one that lets us train in a single pass over the data, but one that classifiers based on tree structures and small graphs can systematically beat by a few accuracy points on virtually every UCI benchmark. This part walks the spectrum from “no dependencies” (Naive Bayes) to “all dependencies” (full joint), showing the three sweet spots that practitioners actually use: SPODE, TAN and AODE. The same factorisation idea, taken to its general form, is the Bayesian network.

ML Math Derivations (9): Naive Bayes

Wed, 28 Jan 2026 09:00:00 +0000

Hook: A spam filter that trains in milliseconds, scales to a million features, has no hyperparameters worth tuning, and still beats much fancier models on short-text problems. Naive Bayes pulls this off by making one outrageous assumption — every feature is independent given the class — and refusing to apologise for it. The assumption is wrong on essentially every real dataset, yet the classifier works. Understanding why is a tour through generative modelling, MAP estimation, Dirichlet priors, and the bias–variance tradeoff. This article walks the entire path.

ML Math Derivations (8): Support Vector Machines

Tue, 27 Jan 2026 09:00:00 +0000

Hook. You have two clouds of points and infinitely many lines that separate them. Which line is “best”? SVM gives a startlingly geometric answer: the line that sits in the middle of the widest empty corridor between the two classes. Push that single idea through Lagrangian duality and it produces a sparse model (only the points on the corridor wall matter), a quadratic program with a global optimum, and — almost as a free gift — the kernel trick that lets the same linear machinery carve curved boundaries in infinite-dimensional spaces.

ML Math Derivations (7): Decision Trees

Mon, 26 Jan 2026 09:00:00 +0000

Hook. A decision tree mimics how humans actually decide things: ask a question, branch on the answer, ask the next question. The math under that intuition is surprisingly rich — entropy from information theory tells us which question to ask first, the Gini index gives a cheaper proxy that lands on essentially the same trees, and cost-complexity pruning gives a principled way to stop the tree from memorising noise. Almost every modern boosted ensemble (XGBoost, LightGBM, CatBoost) is just a clever sum of these objects, so getting the foundations right pays off many times over.

ML Math Derivations (6): Logistic Regression and Classification

Sun, 25 Jan 2026 09:00:00 +0000

Hook. Linear regression maps inputs to any real number — but what if the output has to be a probability between 0 and 1? Logistic regression solves this with one elegant trick: a sigmoid squashing function. Despite its name, logistic regression is a classification algorithm, and its math underpins every neuron in every modern neural network.

What You Will Learn#

Why sigmoid is the natural way to turn a real-valued score into a probability, and why its derivative is so clean.
How cross-entropy loss falls out of maximum likelihood estimation in two lines.
Why cross-entropy beats MSE for classification — a vanishing-gradient argument made visible.
The full gradient and Hessian for both binary and multi-class (softmax) cases, and why the loss is convex.
L1, L2 and elastic-net regularization, and the Bayesian priors hiding behind them.
Decision-boundary geometry and the threshold-free metrics (ROC / PR / AUC) that you actually need under class imbalance.

Prerequisites#

Calculus: chain rule, partial derivatives.
Linear algebra: matrix multiplication, transpose.
Probability: Bernoulli and categorical distributions, likelihood.
Familiarity with Part 5: Linear Regression .

From Linear Models to Probabilistic Classification#

The Problem with Raw Linear Output#

Linear regression gives us $\hat y = \mathbf{w}^\top \mathbf{x}$ , which is unbounded. For classification, two things go wrong:

ML Math Derivations (5): Linear Regression

Sat, 24 Jan 2026 09:00:00 +0000

Hook. In 1886 Francis Galton noticed something strange about heredity: children of unusually tall (or short) parents tended to be closer to the average than their parents were. He called this drift toward the mean regression, and the name stuck. The statistical curiosity grew up into the most consequential model in machine learning — not because linear regression is powerful on its own, but because almost every other algorithm (logistic regression, neural networks, kernel methods) is some twist on the same idea: fit a line, but in the right space.

ML Math Derivations (4): Convex Optimization Theory

Fri, 23 Jan 2026 09:00:00 +0000

What You Will Learn#

In 1947, George Dantzig proposed the simplex method for linear programming, and a working theory of optimization was born. Eight decades later, optimization has become the engine of machine learning: every model you train, from a one-line linear regression to a 70B-parameter language model, is the answer to some optimization problem.

ML Math Derivations (3): Probability Theory and Statistical Inference

Thu, 22 Jan 2026 09:00:00 +0000

What You Will Learn#

In 1912, Ronald Fisher introduced maximum likelihood estimation in a short paper that quietly redefined statistics. His insight was almost embarrassingly simple: if a parameter setting makes the observed data extremely likely, it is probably correct. Almost every modern learning algorithm — from logistic regression to large language models — descends from this idea.

ML Math Derivations (2): Linear Algebra and Matrix Theory

Wed, 21 Jan 2026 09:00:00 +0000

Why this chapter, and what’s different#

If you have already worked through a standard linear-algebra course you have seen most of these objects. This chapter is not that course. It is the ML practitioner’s slice of linear algebra: the half-dozen ideas that actually appear when you implement gradient descent, run PCA, train a neural net, or read a paper.

ML Math Derivations (1): Introduction and Mathematical Foundations

Tue, 20 Jan 2026 09:00:00 +0000

What this chapter does#

In 2005 Google Research showed, on a public benchmark, that a statistical translation model trained on raw bilingual text could outperform decades of carefully engineered linguistic rules. The conclusion was uncomfortable for the experts of the day, but mathematically liberating: a system that has never been told the rules of a language can still recover them, given enough examples. Why?

Symplectic Geometry and Structure-Preserving Neural Networks

Mon, 28 Jul 2025 09:00:00 +0000

Train a vanilla feedforward network to predict a one-dimensional harmonic oscillator. Validate it on the next ten time steps — the error is fine. Now roll it out for a thousand steps. The orbit no longer closes, the energy creeps upward, and what should be periodic motion turns into a slow spiral. The network learned to fit data points but never learned the physics. Structure-preserving networks fix this by incorporating geometric invariants — energy conservation, the symplectic 2-form, and the Euler-Lagrange equations — directly into the architecture, ensuring the learned model cannot violate them no matter how long you integrate.

Transfer Learning (1): Fundamentals and Core Concepts

Thu, 01 May 2025 09:00:00 +0000

You spent two weeks training an ImageNet classifier on a rack of GPUs. On Monday morning, your team lead asks for a chest X-ray pneumonia model, and the entire labeled dataset is two hundred images. Do you book another two weeks of GPU time and start from scratch?

Of course not. You use what the ImageNet model already knows about edges, textures, and shapes, swap out the last layer, and fine-tune on the X-rays. Two hours later, you have a model that beats anything you could have trained from random weights with so little data. That’s transfer learning, and it’s why most real-world deep learning projects ship in days instead of months.

Essence of Linear Algebra (15): Linear Algebra in Machine Learning

Wed, 09 Apr 2025 09:00:00 +0000

Ask any senior ML engineer “what math do you actually use day to day?” and the answer is almost always linear algebra. Calculus shows up in derivations; probability shows up in modeling; but the runtime of a real ML system is dominated by matrix-vector multiplies, decompositions, and projections. PyTorch’s Linear, scikit-learn’s PCA, Spark MLlib’s ALS, and a Transformer’s attention head are all the same primitive in different costumes.

This chapter covers the algorithms used in production ML systems — PCA, LDA, SVM with kernels, matrix factorization for recommenders, regularized linear regression, neural network layers, and attention — and explains the linear algebra behind each. We focus on intuition first, then geometry, and finally formulas.

Probability and Statistics (8): Bayesian Statistics — Priors, Posteriors, and Why Frequentists Argue

Fri, 30 Aug 2024 09:00:00 +0000

Two statisticians walk into a bar. One says: “The probability of rain tomorrow is 30%.” The other replies: “Probability is a long-run frequency. Since tomorrow only happens once, that statement is meaningless.” The first one says: “It quantifies my uncertainty about a unique event.” They proceed to argue for the rest of the evening.

This, roughly, is the Bayesian-frequentist debate. It’s not about who’s right — both frameworks are mathematically consistent. It’s about what “probability” means and how that interpretation shapes the tools you use. Having worked through six articles of largely frequentist reasoning, we now develop the Bayesian perspective: parameters are random, data update our beliefs, and uncertainty is quantified through distributions rather than confidence intervals.

PDE and ML (8): Reaction-Diffusion Systems and Graph Neural Networks

Wed, 14 Aug 2024 09:00:00 +0000

Anyone who has trained a deep GNN has seen it collapse — past a dozen or so layers, every node’s embedding becomes nearly identical and the model goes mush. There is a name for this — over-smoothing — and the underlying math is surprisingly clean: GNN message passing is essentially a diffusion equation on the graph, and diffusion’s long-time behavior is to flatten everything to a constant.

PDE and ML (7): Diffusion Models and Score Matching

Tue, 30 Jul 2024 09:00:00 +0000

The output side of a diffusion model is familiar: a high-quality image. The training objective, on the other hand, looks counter-intuitive at first sight — add noise to the data until it is fully Gaussian, then learn to denoise step by step. Why is this detour more effective than learning the data distribution directly?

The answer is hidden in PDEs. The forward noising process is a heat equation (or, more generally, a Fokker–Planck equation), and it admits a reverse-time version — provided we know the score (the gradient of the log-density) at every time. Score matching is the standard way to learn that score. From this angle, DDPM, DDIM, and score-based SDEs are not three different algorithms but three discretizations of the same PDE story.

PDE and ML (6): Continuous Normalizing Flows and Neural ODE

Mon, 15 Jul 2024 09:00:00 +0000

How do you turn an isotropic Gaussian into a photograph of a cat?

Normalizing flows give the most direct answer: stack a sequence of invertible transformations and let them push the simple distribution into the complex one. This article’s continuous version (CNF) takes that idea to the limit — let the step size go to zero and the discrete chain becomes an ODE. Invertibility is automatic, and the change of density is governed by the instantaneous change of variables formula.

PDE and ML (5): Symplectic Geometry and Structure-Preserving Networks

Sun, 30 Jun 2024 09:00:00 +0000

A pendulum keeps swinging for a very long time without slowly winding down — energy is conserved. The Earth orbits the Sun for billions of years without flying off — angular momentum is conserved. Behind every “this quantity stays constant” lurks a piece of geometry called symplectic structure.

Train a vanilla Neural ODE on pendulum data: after a few hundred steps the energy drifts. The network can fit the short-term trajectory just fine; what it can’t fit is the long-time conservation law. Structure-preserving networks (HNN, LNN, SympNet) take a different approach: bake the conservation law into the architecture so the network cannot violate it.

PDE and ML (3): Variational Principles and Optimization

Fri, 31 May 2024 09:00:00 +0000

What is the essence of neural-network training? When we run gradient descent in a high-dimensional parameter space, is there a deeper continuous-time dynamics at work? As the network width tends to infinity, does discrete parameter updating converge to some elegant partial differential equation? The answers live at the intersection of the calculus of variations, optimal transport, and PDE theory.

The last decade of deep-learning success has rested mostly on engineering intuition. Recently, however, mathematicians have made a striking observation: viewing a neural network as a particle system on the space of probability measures, and studying its evolution under Wasserstein geometry, exposes the global structure of training — convergence guarantees, the role of over-parameterization, the meaning of initialization. The tool that makes this visible is the variational principle — from least action in physics, through the JKO scheme of modern optimal transport, to the mean-field limit of neural networks.