Tags

ML

Jul 30, 2025 Standalone 26 min read

Reparameterization Trick & Gumbel-Softmax: A Deep Dive

Make sense of the reparameterization trick and Gumbel-Softmax: why gradients can flow through sampling, how temperature trades bias for variance, and the practical pitfalls of training discrete latent variables …

Jul 28, 2025 Standalone 20 min read

Low-Rank Matrix Approximation and the Pseudoinverse: From SVD to Regularization

From the least-squares view to the Moore-Penrose pseudoinverse, the four Penrose conditions, computation via SVD, truncated SVD, Tikhonov regularization, and modern applications from PCA to LoRA.

Jun 27, 2023 Standalone 24 min read

Variational Autoencoder (VAE): From Intuition to Implementation and Troubleshooting

Build a VAE from scratch in PyTorch. Covers the ELBO objective, reparameterization trick, posterior collapse fixes, beta-VAE, and a complete training pipeline.

Sep 29, 2022 Optimization Theory 22 min read

Optimization (11): Non-Convex Optimization and Saddle Escape

Why does SGD work for training neural networks despite the non-convex landscape? We prove perturbed GD escapes strict saddles in polynomial time, derive convergence under the Polyak-Lojasiewicz condition, and survey what …

Sep 27, 2022 Optimization Theory 20 min read

Optimization (10): Stochastic Optimization and Variance Reduction

Why does SGD work? We prove the O(1/sqrt(T)) convex rate and the O(1/(mu T)) strongly convex rate from the gradient noise budget. Then variance reduction: SVRG, SAGA, Katyusha — methods that get to the linear rate of …

Sep 26, 2022 Optimization Theory 22 min read

Optimization (9): Interior-Point Methods and Self-Concordant Barriers

How interior-point methods became the default solver for convex programming: replace inequalities with a logarithmic barrier, parametrize the central path, and apply Newton's method. Includes the self-concordance …

Sep 24, 2022 Optimization Theory 20 min read

Optimization (8): Lagrangian Duality and KKT Conditions

How constraints become prices: the Lagrangian, weak duality, Slater's condition for strong duality, the KKT system as necessary and sufficient optimality, and why the SVM dual is much smaller than the SVM primal. …

Sep 22, 2022 Optimization Theory 24 min read

Optimization (7): Second-Order Methods

Second-order methods break the sqrt(kappa) barrier by using curvature. We prove Newton's quadratic local convergence, derive BFGS from a secant condition + low-rank update, walk through L-BFGS's two-loop recursion that …

Sep 21, 2022 Optimization Theory 32 min read

Optimization (6): Composite Optimization and Proximal Methods

A systematic walk through the proximal operator: convex-analysis basics, the Moreau envelope, closed-form proxes, and how they power ISTA, FISTA, ADMM, LASSO, and SVM in practice.

Sep 20, 2022 Optimization Theory 26 min read

Optimization (5): Acceleration Beyond Nesterov

What does it really mean for a first-order method to be optimal? We prove a tight lower bound matching Nesterov's rate, derive Polyak's Heavy-Ball method as the continuous-time limit, build a unified Lyapunov framework …

Sep 18, 2022 Optimization Theory 40 min read

Optimization (4): Learning Rate and Schedules

A practitioner's guide to the single most important hyperparameter: why too-large LR explodes, how warmup and schedules really work, the LR range test, the LR-batch-size-weight-decay coupling, and recent ideas like WSD, …

Sep 16, 2022 Optimization Theory 24 min read

Optimization (3): The Gradient Descent Family from SGD to AdamW

One article that traces the full lineage GD -> SGD -> Momentum -> NAG -> AdaGrad -> RMSProp -> Adam -> AdamW, then onwards to Lion / Sophia / Schedule-Free. Each step is framed by the specific failure of the previous …

Sep 15, 2022 Optimization Theory 28 min read

Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration

Three concepts that demystify most of optimization: Lipschitz smoothness fixes the maximum step size, strong convexity sets the convergence rate and uniqueness of the minimizer, and Nesterov acceleration replaces kappa …

Sep 14, 2022 Optimization Theory 26 min read

Optimization (1): Convex Analysis Foundations

The geometric and analytic toolkit that unlocks the rest of the series: convex sets, convex functions, the conjugate (Fenchel) transform, subgradients, and the indicator/support function pair. Includes complete proofs of …

Dec 30, 2021 Kernel Methods 38 min read

Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide

Deep kernel learning combines neural feature extractors with kernel methods. When to pick kernels over deep nets, hyperparameter tuning playbook, common failure modes, and a final 5-step kernel decision flowchart.

Dec 24, 2021 Kernel Methods 52 min read

Kernel Methods (7): Large-Scale Kernels — Nystrom Approximation and Random Fourier Features

Kernel methods are O(n^3). Nystrom approximation and Random Fourier Features pull them back to linear time without giving up the kernel trick's expressive power.

Dec 19, 2021 Kernel Methods 34 min read

Kernel Methods (6): Gaussian Processes — When Kernels Meet Bayesian Inference

Gaussian Processes turn kernels into a Bayesian model — posterior with uncertainty, marginal likelihood for hyperparameters, and the kernel as a prior over functions.

Dec 14, 2021 Kernel Methods 44 min read

Kernel Methods (5): Kernel SVM, Kernel PCA, and Kernel Ridge Regression

The classic algorithms, kernelized — SVM's dual form, Kernel PCA's eigendecomposition in feature space, and Kernel Ridge's closed-form solution. With sklearn code and worked examples.

Dec 9, 2021 Kernel Methods 44 min read

Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More

A tour of the kernels you'll actually use: RBF (Gaussian), polynomial, linear, Matern, periodic, sigmoid. When to pick which, hyperparameter intuition, and how kernels combine.

Dec 4, 2021 Kernel Methods 44 min read

Kernel Methods (3): RKHS — The Theoretical Soul of Kernel Methods

Reproducing Kernel Hilbert Space — the function space where kernel methods live. The reproducing property, the representer theorem, and why finite-data optimization works in infinite dimensions.

Nov 29, 2021 Kernel Methods 76 min read

Kernel Methods (2): Mathematical Foundations — Positive-Definite Kernels and Mercer's Theorem

What makes a function a valid kernel? Positive-definiteness, the Gram matrix test, and Mercer's theorem — the spectral decomposition that justifies the kernel trick.

Nov 24, 2021 Kernel Methods 66 min read

Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms

Linear algorithms can't capture non-linear patterns. The kernel trick lets you keep the linear algorithm's elegance AND model non-linear relationships — without writing the high-dimensional feature map. Part 1 of an …