ML Math Derivations

Deriving the algorithms — no hand-waving.

20 articles

  1. 01

    ML Math Derivations (1): Introduction and Mathematical Foundations

    Why can machines learn from data at all? This first chapter builds the mathematical theory of learning from first …

    42 min
  2. 02

    ML Math Derivations (2): Linear Algebra and Matrix Theory

    The language of machine learning is linear algebra. This article derives vector spaces, eigendecomposition, SVD, and …

    30 min
  3. 03

    ML Math Derivations (3): Probability Theory and Statistical Inference

    Machine learning is uncertainty modeling. This article derives probability spaces, common distributions, MLE, Bayesian …

    28 min
  4. 04

    ML Math Derivations (4): Convex Optimization Theory

    Nearly every ML algorithm is an optimization problem. This article derives convex sets, convex functions, gradient …

    44 min
  5. 05

    ML Math Derivations (5): Linear Regression

    A complete derivation of linear regression from three perspectives -- algebra (the normal equation), geometry …

    32 min
  6. 06

    ML Math Derivations (6): Logistic Regression and Classification

    Complete derivation of logistic regression from sigmoid to softmax, cross-entropy loss, gradient computation, …

    32 min
  7. 07

    ML Math Derivations (7): Decision Trees

    From information entropy to the Gini index, from ID3 to CART — a complete derivation of decision-tree mathematics: split …

    38 min
  8. 08

    ML Math Derivations (8): Support Vector Machines

    Complete SVM derivation from maximum margin to Lagrangian duality, KKT conditions, soft margin, kernel trick, and SMO …

    28 min
  9. 09

    ML Math Derivations (9): Naive Bayes

    Rigorous derivation of Naive Bayes from Bayes theorem through conditional independence, parameter estimation, Laplace …

    34 min
  10. 10

    ML Math Derivations (10): Semi-Naive Bayes and Bayesian Networks

    From SPODE, TAN and AODE to full Bayesian networks: how relaxing the conditional-independence assumption -- through …

    28 min
  11. 11

    ML Math Derivations (11): Ensemble Learning

    Derive why combining weak learners produces strong ones. Covers bias-variance decomposition, Bagging/Random Forest …

    36 min
  12. 12

    ML Math Derivations (12): XGBoost and LightGBM

    Derive XGBoost's second-order Taylor expansion, regularised objective and split-gain formula, then explore LightGBM's …

    28 min
  13. 13

    ML Math Derivations (13): EM Algorithm and GMM

    Derive the EM algorithm from Jensen's inequality and the ELBO, prove its monotone-ascent guarantee, and apply it to …

    24 min
  14. 14

    ML Math Derivations (14): Variational Inference and Variational EM

    A first-principles derivation of variational inference. From the ELBO identity and the mean-field assumption to the CAVI …

    28 min
  15. 15

    ML Math Derivations (15): Hidden Markov Models

    Derive the three classical HMM algorithms from one principle (factorising the joint, then sharing sub-computations …

    24 min
  16. 16

    ML Math Derivations (16): Conditional Random Fields

    Why do CRFs outperform HMMs on sequence labeling? This article derives linear-chain CRF from the ground up -- potential …

    26 min
  17. 17

    ML Math Derivations (17): Dimensionality Reduction and PCA

    High-dimensional spaces are hostile to distance-based learning. This article derives PCA from two equivalent angles (max …

    26 min
  18. 18

    ML Math Derivations (18): Clustering Algorithms

    How do you find groups in unlabeled data? This article derives K-means (Lloyd + K-means++), hierarchical, DBSCAN, …

    32 min
  19. 19

    ML Math Derivations (19): Neural Networks and Backpropagation

    How does a neural network learn? This article derives forward propagation, the chain rule mechanics of backpropagation, …

    32 min
  20. 20

    ML Math Derivations (20): Regularization and Model Selection

    The series finale: from the bias-variance decomposition to L1/L2 geometry, dropout as a sub-network sampler, k-fold CV, …

    28 min