Optimization on Chen Kai Blog

Low-Rank Matrix Approximation and the Pseudoinverse: From SVD to Regularization

Mon, 28 Jul 2025 09:00:00 +0000

Real data matrices are almost never both square and full rank: correlated features, too few samples, and noise-induced ill-conditioning all make “matrix inverse” either undefined or numerically useless. The pseudoinverse (Moore-Penrose inverse) preserves the spirit of an inverse while dropping the impossible-to-meet requirements: it redefines the “solution” of a linear system as the least-squares solution, breaking ties by picking the one with minimum norm. This post derives the pseudoinverse from that least-squares viewpoint, gives the four Penrose conditions, builds it from the SVD, and connects this single object to the Eckart-Young low-rank approximation theorem, PCA, recommender-system matrix factorization, and LoRA fine-tuning.

Essence of Linear Algebra (11): Matrix Calculus and Optimization — The Engine Behind Machine Learning

Wed, 12 Mar 2025 09:00:00 +0000

From Shower Knobs to Neural Networks#

Every morning you train a tiny neural network. The water comes out too cold, so you nudge the knob — a parameter — in some direction. A second later you observe a new temperature — the error signal — and nudge again. After three or four iterations you have converged.

Optimization (12): Discrete and Global Optimization

Fri, 30 Sep 2022 09:00:00 +0000

The first eleven articles in this series tackled continuous convex problems (or convex relaxations of non-convex ones). This final article addresses two harder regimes:

Discrete optimization: variables take integer or combinatorial values. The feasible set is a finite (but exponentially large) collection of points. Linear and convex tools no longer apply directly because there are no derivatives across the integer lattice.
Global non-convex optimization: variables are continuous but the function has many local minima, and we want the global one. Methods like Newton and L-BFGS only find local minima.

Both regimes share a key feature: provably optimal algorithms are exponential in the worst case. Practical solutions come from (a) exact algorithms with smart pruning (branch-and-bound) and (b) heuristics that find good (but not optimal) solutions in polynomial time.

Optimization (11): Non-Convex Optimization and Saddle Escape

Thu, 29 Sep 2022 09:00:00 +0000

For non-convex $$f$$ , gradient descent has no global guarantee. The best we can say is that $\nabla f(x_t) \to 0$ — we converge to a stationary point, which could be a local min, a saddle, or even a local max. This article asks: when can we say more?

Three positive results:

Saddle escape: under a “strict saddle” assumption, perturbed GD converges to local minima in polynomial time. Saddle points are unstable; Brownian noise (or just numerical perturbation) escapes them.
PL condition: a relaxation of strong convexity that holds in over-parameterized neural networks. Under PL, vanilla GD gets the linear rate $O(\log(1/\epsilon))$ even without convexity.
Loss landscape facts: for sufficiently wide neural networks, all local minima are global, and SGD’s noise gives implicit bias toward flat minima with better generalization.

Each is rigorous in its setting. The article also discusses what is not known — there is no general theorem saying “SGD finds the global optimum of a deep network.”

Optimization (10): Stochastic Optimization and Variance Reduction

Tue, 27 Sep 2022 09:00:00 +0000

Stochastic gradient descent samples a single component gradient per step — far cheaper than full GD, but at what cost in convergence? Can we retain the cheap per-iteration cost while recovering the fast rate of deterministic methods? This article quantifies the tradeoff from a noise-budget perspective and derives the solution.

\min_x f(x) := \frac{1}{n} \sum_{i=1}^n f_i(x),

deterministic gradient descent costs $$O(n)$$ per step but converges in $O(\kappa \log(1/\epsilon))$ steps. Stochastic gradient descent (SGD) costs $$O(1)$$ per step but converges in $O(1/\epsilon^2)$ for convex problems and $O(\kappa^2 \log(1/\epsilon))$ for strongly convex ones. Which is faster depends on $$n$$ , $\kappa$ , and $\epsilon$ .

Optimization (9): Interior-Point Methods and Self-Concordant Barriers

Mon, 26 Sep 2022 09:00:00 +0000

In 1984 Karmarkar showed that LPs could be solved in polynomial time practically — not just theoretically (the ellipsoid method had achieved this on paper). His interior-point method stayed inside the feasible polytope and converged in $$O(n L)$$ iterations, far better than the simplex method’s exponential worst case. Within a decade, Nesterov & Nemirovski generalized this to all convex programming via the self-concordant barrier framework. The result — $O(\sqrt{n} \log(1/\epsilon))$ Newton iterations for an $$n$$ -dimensional problem — remains the gold standard for medium-scale convex optimization.

Optimization (8): Lagrangian Duality and KKT Conditions

Sat, 24 Sep 2022 09:00:00 +0000

The most consequential idea in constrained optimization is that constraints have prices. The Lagrangian transforms a constrained problem into an unconstrained one by attaching a non-negative multiplier to each inequality and a free multiplier to each equality. The resulting unconstrained problem may be easier (the SVM dual), or it may give a verifiable lower bound (the LP duality used to certify integer programs).

This article develops:

Weak duality: the dual is always a lower bound on the primal — no assumptions needed.
Strong duality: under Slater’s condition (or convexity + linear constraints), the gap is zero.
KKT conditions: primal stationarity + dual feasibility + complementary slackness, the practical optimality system.
Saddle-point characterization: the Lagrangian’s saddle point coincides with the optimal primal–dual pair.

Each result is proved or carefully cited. We close with the SVM example, where the dual cuts the problem dimension from $$d$$ (number of features) to $$n$$ (number of training points) — the original kernel-method magic.

Optimization (7): Second-Order Methods

Thu, 22 Sep 2022 09:00:00 +0000

First-order methods top out at $O(\sqrt{\kappa})$ iterations to reach $\epsilon$ -accuracy (article 05). Second-order methods break this barrier by using curvature: Newton’s method has quadratic local convergence — the number of correct digits doubles every iteration — and quasi-Newton methods retain most of this speed without computing the Hessian explicitly.

The cost is in the per-iteration work: Newton solves an $n \times n$ linear system per step ( $$O(n^3)$$ ), BFGS maintains an $n \times n$ matrix ( $$O(n^2)$$ per step + $$O(n^2)$$ memory), and L-BFGS uses only $$O(mn)$$ memory for a chosen history $$m$$ (typically 5–20).

Optimization (6): Composite Optimization and Proximal Methods

Wed, 21 Sep 2022 09:00:00 +0000

When your objective contains a non-smooth piece (sparse regularisation, total variation, an indicator of a constraint set) or a constraint that is hard to handle directly, “just do gradient descent” stalls — there is no gradient at the kink, or every step violates feasibility. The proximal operator is the engineered, beautiful workaround: think of each update as “take a step on the smooth part, then run a tiny penalised minimisation that pulls the iterate back toward a structured solution”.

Optimization (5): Acceleration Beyond Nesterov

Tue, 20 Sep 2022 09:00:00 +0000

Article 02 introduced Nesterov acceleration and showed it improves the per-iteration cost from $\kappa$ to $\sqrt{\kappa}$ . This article asks the deeper questions:

Why $\sqrt{\kappa}$ and not faster? We prove a matching lower bound — no first-order method can do better.
Is Nesterov the only way? Polyak’s Heavy-Ball method achieves the same rate using a completely different update rule.
Can we accelerate any solver? The Catalyst framework wraps a black-box optimizer to gain the accelerated rate, at the cost of solving a regularized subproblem.

The unifying tool is a Lyapunov potential — a non-negative quantity that the algorithm decreases at every step. Both Nesterov and Heavy-Ball admit Lyapunov proofs, and the lower bound essentially says no Lyapunov decrease can happen faster.

Optimization (4): Learning Rate and Schedules

Sun, 18 Sep 2022 09:00:00 +0000

Your model diverges. You halve the learning rate. Now it trains, but takes forever. You halve again — now the loss is a flat line. Sound familiar? Of all the knobs you can turn, learning rate is the one that most often decides whether training converges, crawls, or blows up. This guide gives you the intuition, the minimal math, and a practical workflow to get it right — from a 12-layer CNN on your laptop to a 70B-parameter LLM on a thousand GPUs.

Optimization (3): The Gradient Descent Family from SGD to AdamW

Fri, 16 Sep 2022 09:00:00 +0000

Why is “tuning the LR is an art” a meme for ResNet, while every modern LLM paper just writes “AdamW, $\beta_1{=}0.9, \beta_2{=}0.95, \mathrm{wd}{=}0.1$ ” and moves on? It is not an accident — it is the end-point of three decades of optimizer evolution.

This post walks the lineage end-to-end on a single thread: each step exists because of a specific failure of the previous one. We end with the three directions that have actually entered the post-2023 large-model toolkit: Lion, Sophia, and Schedule-Free.

Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration

Thu, 15 Sep 2022 09:00:00 +0000

A surprising amount of “optimizer folklore” collapses into three concepts:

How steep is the gradient? Lipschitz smoothness ( $$L$$ -smoothness) caps the step size.
How sharp is the bottom? $\mu$ -strong convexity sets the convergence rate and forces the minimizer to be unique.
Can we get there faster without losing stability? Nesterov acceleration and adaptive restart turn the per-condition-number cost from $\kappa$ into $\sqrt{\kappa}$ .

This post lays them out on a single thread: nail the geometric intuition with the minimum number of inequalities, prove the key theorems, then close with a least-squares experiment that pits GD, Heavy Ball, and Nesterov against each other. The goal is not to stack formulas — it is to make you able to look at a new problem and instantly answer “what step size, what rate, is acceleration worth it?”

Optimization (1): Convex Analysis Foundations

Wed, 14 Sep 2022 09:00:00 +0000

This article is the foundation the rest of the series is built on. Almost every result we will prove later — convergence rates of gradient descent, Lagrangian duality, the proximal operator, even the analysis of stochastic methods — relies on a small set of facts about convex sets and convex functions. We will derive all of them from scratch.

If you only remember three things from this article, make it these:

Python Engineering (8): Performance — Profiling, Caching, and Knowing When to Stop

Wed, 27 Apr 2022 09:00:00 +0000

Donald Knuth’s famous quote is often half-remembered. The full version is: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” The second sentence is the key. Performance work isn’t about making everything fast; it’s about finding the 3% that matters and making that fast.

This article is about finding that 3%. You’ll learn to profile first, optimize second, and measure the impact of each change.

Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms

Wed, 24 Nov 2021 09:00:00 +0000

The first time I tried to fit a logistic regression to a dataset of two interlocking spirals, I burned an afternoon tweaking the regularizer, swapping solvers, and rescaling features — convinced that I was doing something wrong. The accuracy hovered around 50%. That is the noise floor of a coin flip; my model was, in a very literal sense, learning nothing.

The model was not buggy. The data was simply not the kind of object a straight line can describe. No amount of C, class_weight, or tol was going to change that. Once you have seen this failure mode once, you start noticing it everywhere — in customer-churn data with non-monotone relationships, in image classification before deep learning, in any regression where the trend bends. A linear algorithm has a hard ceiling, and you only break through that ceiling by changing the kind of object the algorithm operates on.