Tags

Optimization

Jul 28, 2025 Standalone 20 min read

Low-Rank Matrix Approximation and the Pseudoinverse: From SVD to Regularization

From the least-squares view to the Moore-Penrose pseudoinverse, the four Penrose conditions, computation via SVD, truncated SVD, Tikhonov regularization, and modern applications from PCA to LoRA.

Mar 12, 2025 Linear Algebra 30 min read

Essence of Linear Algebra (11): Matrix Calculus and Optimization — The Engine Behind Machine Learning

Adjusting the shower temperature is a tiny version of training a neural network: you change a parameter based on an error signal. Matrix calculus is the language that scales this idea to millions of parameters, and …

Sep 30, 2022 Optimization Theory 38 min read

Optimization (12): Discrete and Global Optimization

When variables are integer-valued or the problem is non-convex with multiple basins, classical convex methods stop working. This article surveys what does work: integer programming via branch-and-bound, LP relaxation gap …

Sep 29, 2022 Optimization Theory 22 min read

Optimization (11): Non-Convex Optimization and Saddle Escape

Why does SGD work for training neural networks despite the non-convex landscape? We prove perturbed GD escapes strict saddles in polynomial time, derive convergence under the Polyak-Lojasiewicz condition, and survey what …

Sep 27, 2022 Optimization Theory 20 min read

Optimization (10): Stochastic Optimization and Variance Reduction

Why does SGD work? We prove the O(1/sqrt(T)) convex rate and the O(1/(mu T)) strongly convex rate from the gradient noise budget. Then variance reduction: SVRG, SAGA, Katyusha — methods that get to the linear rate of …

Sep 26, 2022 Optimization Theory 22 min read

Optimization (9): Interior-Point Methods and Self-Concordant Barriers

How interior-point methods became the default solver for convex programming: replace inequalities with a logarithmic barrier, parametrize the central path, and apply Newton's method. Includes the self-concordance …

Sep 24, 2022 Optimization Theory 20 min read

Optimization (8): Lagrangian Duality and KKT Conditions

How constraints become prices: the Lagrangian, weak duality, Slater's condition for strong duality, the KKT system as necessary and sufficient optimality, and why the SVM dual is much smaller than the SVM primal. …

Sep 22, 2022 Optimization Theory 24 min read

Optimization (7): Second-Order Methods

Second-order methods break the sqrt(kappa) barrier by using curvature. We prove Newton's quadratic local convergence, derive BFGS from a secant condition + low-rank update, walk through L-BFGS's two-loop recursion that …

Sep 21, 2022 Optimization Theory 32 min read

Optimization (6): Composite Optimization and Proximal Methods

A systematic walk through the proximal operator: convex-analysis basics, the Moreau envelope, closed-form proxes, and how they power ISTA, FISTA, ADMM, LASSO, and SVM in practice.

Sep 20, 2022 Optimization Theory 26 min read

Optimization (5): Acceleration Beyond Nesterov

What does it really mean for a first-order method to be optimal? We prove a tight lower bound matching Nesterov's rate, derive Polyak's Heavy-Ball method as the continuous-time limit, build a unified Lyapunov framework …

Sep 18, 2022 Optimization Theory 40 min read

Optimization (4): Learning Rate and Schedules

A practitioner's guide to the single most important hyperparameter: why too-large LR explodes, how warmup and schedules really work, the LR range test, the LR-batch-size-weight-decay coupling, and recent ideas like WSD, …

Sep 16, 2022 Optimization Theory 24 min read

Optimization (3): The Gradient Descent Family from SGD to AdamW

One article that traces the full lineage GD -> SGD -> Momentum -> NAG -> AdaGrad -> RMSProp -> Adam -> AdamW, then onwards to Lion / Sophia / Schedule-Free. Each step is framed by the specific failure of the previous …

Sep 15, 2022 Optimization Theory 28 min read

Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration

Three concepts that demystify most of optimization: Lipschitz smoothness fixes the maximum step size, strong convexity sets the convergence rate and uniqueness of the minimizer, and Nesterov acceleration replaces kappa …

Sep 14, 2022 Optimization Theory 26 min read

Optimization (1): Convex Analysis Foundations

The geometric and analytic toolkit that unlocks the rest of the series: convex sets, convex functions, the conjugate (Fenchel) transform, subgradients, and the indicator/support function pair. Includes complete proofs of …

Apr 27, 2022 Python Engineering 36 min read

Python Engineering (8): Performance — Profiling, Caching, and Knowing When to Stop

Profile Python code to find real bottlenecks, apply caching and vectorization where they matter, and avoid the trap of premature optimization.

Nov 24, 2021 Kernel Methods 66 min read

Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms

Linear algorithms can't capture non-linear patterns. The kernel trick lets you keep the linear algorithm's elegance AND model non-linear relationships — without writing the high-dimensional feature map. Part 1 of an …