ML
Reparameterization Trick & Gumbel-Softmax: A Deep Dive
Make sense of the reparameterization trick and Gumbel-Softmax: why gradients can flow through sampling, how temperature trades bias for variance, and the practical pitfalls of training discrete latent variables …
Low-Rank Matrix Approximation and the Pseudoinverse: From SVD to Regularization
From the least-squares view to the Moore-Penrose pseudoinverse, the four Penrose conditions, computation via SVD, truncated SVD, Tikhonov regularization, and modern applications from PCA to LoRA.
Variational Autoencoder (VAE): From Intuition to Implementation and Troubleshooting
Build a VAE from scratch in PyTorch. Covers the ELBO objective, reparameterization trick, posterior collapse fixes, beta-VAE, and a complete training pipeline.
Optimization (11): Non-Convex Optimization and Saddle Escape
Why does SGD work for training neural networks despite the non-convex landscape? We prove perturbed GD escapes strict saddles in polynomial time, derive convergence under the Polyak-Lojasiewicz condition, and survey what …
Optimization (10): Stochastic Optimization and Variance Reduction
Why does SGD work? We prove the O(1/sqrt(T)) convex rate and the O(1/(mu T)) strongly convex rate from the gradient noise budget. Then variance reduction: SVRG, SAGA, Katyusha — methods that get to the linear rate of …
Optimization (9): Interior-Point Methods and Self-Concordant Barriers
How interior-point methods became the default solver for convex programming: replace inequalities with a logarithmic barrier, parametrize the central path, and apply Newton's method. Includes the self-concordance …
Optimization (8): Lagrangian Duality and KKT Conditions
How constraints become prices: the Lagrangian, weak duality, Slater's condition for strong duality, the KKT system as necessary and sufficient optimality, and why the SVM dual is much smaller than the SVM primal. …
Optimization (7): Second-Order Methods
Second-order methods break the sqrt(kappa) barrier by using curvature. We prove Newton's quadratic local convergence, derive BFGS from a secant condition + low-rank update, walk through L-BFGS's two-loop recursion that …
Optimization (6): Composite Optimization and Proximal Methods
A systematic walk through the proximal operator: convex-analysis basics, the Moreau envelope, closed-form proxes, and how they power ISTA, FISTA, ADMM, LASSO, and SVM in practice.
Optimization (5): Acceleration Beyond Nesterov
What does it really mean for a first-order method to be optimal? We prove a tight lower bound matching Nesterov's rate, derive Polyak's Heavy-Ball method as the continuous-time limit, build a unified Lyapunov framework …
Optimization (4): Learning Rate and Schedules
A practitioner's guide to the single most important hyperparameter: why too-large LR explodes, how warmup and schedules really work, the LR range test, the LR-batch-size-weight-decay coupling, and recent ideas like WSD, …
Optimization (3): The Gradient Descent Family from SGD to AdamW
One article that traces the full lineage GD -> SGD -> Momentum -> NAG -> AdaGrad -> RMSProp -> Adam -> AdamW, then onwards to Lion / Sophia / Schedule-Free. Each step is framed by the specific failure of the previous …
Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration
Three concepts that demystify most of optimization: Lipschitz smoothness fixes the maximum step size, strong convexity sets the convergence rate and uniqueness of the minimizer, and Nesterov acceleration replaces kappa …
Optimization (1): Convex Analysis Foundations
The geometric and analytic toolkit that unlocks the rest of the series: convex sets, convex functions, the conjugate (Fenchel) transform, subgradients, and the indicator/support function pair. Includes complete proofs of …
Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide
Deep kernel learning combines neural feature extractors with kernel methods. When to pick kernels over deep nets, hyperparameter tuning playbook, common failure modes, and a final 5-step kernel decision flowchart.
Kernel Methods (7): Large-Scale Kernels — Nystrom Approximation and Random Fourier Features
Kernel methods are O(n^3). Nystrom approximation and Random Fourier Features pull them back to linear time without giving up the kernel trick's expressive power.
Kernel Methods (6): Gaussian Processes — When Kernels Meet Bayesian Inference
Gaussian Processes turn kernels into a Bayesian model — posterior with uncertainty, marginal likelihood for hyperparameters, and the kernel as a prior over functions.
Kernel Methods (5): Kernel SVM, Kernel PCA, and Kernel Ridge Regression
The classic algorithms, kernelized — SVM's dual form, Kernel PCA's eigendecomposition in feature space, and Kernel Ridge's closed-form solution. With sklearn code and worked examples.
Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More
A tour of the kernels you'll actually use: RBF (Gaussian), polynomial, linear, Matern, periodic, sigmoid. When to pick which, hyperparameter intuition, and how kernels combine.
Kernel Methods (3): RKHS — The Theoretical Soul of Kernel Methods
Reproducing Kernel Hilbert Space — the function space where kernel methods live. The reproducing property, the representer theorem, and why finite-data optimization works in infinite dimensions.
Kernel Methods (2): Mathematical Foundations — Positive-Definite Kernels and Mercer's Theorem
What makes a function a valid kernel? Positive-definiteness, the Gram matrix test, and Mercer's theorem — the spectral decomposition that justifies the kernel trick.
Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms
Linear algorithms can't capture non-linear patterns. The kernel trick lets you keep the linear algorithm's elegance AND model non-linear relationships — without writing the high-dimensional feature map. Part 1 of an …


















