<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Optimization on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/optimization/</link><description>Recent content in Optimization on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Mon, 28 Jul 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/optimization/index.xml" rel="self" type="application/rss+xml"/><item><title>Low-Rank Matrix Approximation and the Pseudoinverse: From SVD to Regularization</title><link>https://www.chenk.top/en/standalone/low-rank-approximation-pseudoinverse/</link><pubDate>Mon, 28 Jul 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/standalone/low-rank-approximation-pseudoinverse/</guid><description>&lt;p>Real data matrices are almost never both square and full rank: correlated features, too few samples, and noise-induced ill-conditioning all make &amp;ldquo;matrix inverse&amp;rdquo; either undefined or numerically useless. The &lt;strong>pseudoinverse&lt;/strong> (Moore-Penrose inverse) preserves the &lt;em>spirit&lt;/em> of an inverse while dropping the impossible-to-meet requirements: it redefines the &amp;ldquo;solution&amp;rdquo; of a linear system as the &lt;strong>least-squares solution&lt;/strong>, breaking ties by picking the one with &lt;strong>minimum norm&lt;/strong>. This post derives the pseudoinverse from that least-squares viewpoint, gives the four Penrose conditions, builds it from the SVD, and connects this single object to &lt;strong>the Eckart-Young low-rank approximation theorem&lt;/strong>, &lt;strong>PCA&lt;/strong>, &lt;strong>recommender-system matrix factorization&lt;/strong>, and &lt;strong>LoRA fine-tuning&lt;/strong>.&lt;/p></description></item><item><title>Essence of Linear Algebra (11): Matrix Calculus and Optimization — The Engine Behind Machine Learning</title><link>https://www.chenk.top/en/linear-algebra/11-matrix-calculus-and-optimization/</link><pubDate>Wed, 12 Mar 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/linear-algebra/11-matrix-calculus-and-optimization/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/linear-algebra/11-matrix-calculus-and-optimization/illustration_1.png" alt="Essence of Linear Algebra (11): Matrix Calculus and Optimization — The Engine Behind Machine Learning — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="from-shower-knobs-to-neural-networks" class="heading-anchor">From Shower Knobs to Neural Networks&lt;a href="#from-shower-knobs-to-neural-networks" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>Every morning you train a tiny neural network. The water comes out too cold, so you nudge the knob — a &lt;em>parameter&lt;/em> — in some direction. A second later you observe a new temperature — the &lt;em>error signal&lt;/em> — and nudge again. After three or four iterations you have converged.&lt;/p></description></item><item><title>Optimization (12): Discrete and Global Optimization</title><link>https://www.chenk.top/en/optimization-theory/12-discrete-global-optimization/</link><pubDate>Fri, 30 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/12-discrete-global-optimization/</guid><description>&lt;p>The first eleven articles in this series tackled &lt;strong>continuous convex&lt;/strong> problems (or convex relaxations of non-convex ones). This final article addresses two harder regimes:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Discrete optimization&lt;/strong>: variables take integer or combinatorial values. The feasible set is a finite (but exponentially large) collection of points. Linear and convex tools no longer apply directly because there are no derivatives across the integer lattice.&lt;/li>
&lt;li>&lt;strong>Global non-convex optimization&lt;/strong>: variables are continuous but the function has many local minima, and we want the &lt;em>global&lt;/em> one. Methods like Newton and L-BFGS only find local minima.&lt;/li>
&lt;/ul>
&lt;p>Both regimes share a key feature: &lt;strong>provably optimal algorithms are exponential&lt;/strong> in the worst case. Practical solutions come from (a) exact algorithms with smart pruning (branch-and-bound) and (b) heuristics that find good (but not optimal) solutions in polynomial time.&lt;/p></description></item><item><title>Optimization (11): Non-Convex Optimization and Saddle Escape</title><link>https://www.chenk.top/en/optimization-theory/11-nonconvex-saddle-escape/</link><pubDate>Thu, 29 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/11-nonconvex-saddle-escape/</guid><description>&lt;p>For non-convex &lt;span class="math-inline">$f$&lt;/span>
, gradient descent has no global guarantee. The best we can say is that &lt;span class="math-inline">$\nabla f(x_t) \to 0$&lt;/span>
 — we converge to a stationary point, which could be a local min, a saddle, or even a local max. This article asks: when can we say more?&lt;/p>
&lt;p>Three positive results:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Saddle escape&lt;/strong>: under a &amp;ldquo;strict saddle&amp;rdquo; assumption, perturbed GD converges to local minima in polynomial time. Saddle points are unstable; Brownian noise (or just numerical perturbation) escapes them.&lt;/li>
&lt;li>&lt;strong>PL condition&lt;/strong>: a relaxation of strong convexity that holds in over-parameterized neural networks. Under PL, vanilla GD gets the linear rate &lt;span class="math-inline">$O(\log(1/\epsilon))$&lt;/span>
 even without convexity.&lt;/li>
&lt;li>&lt;strong>Loss landscape facts&lt;/strong>: for sufficiently wide neural networks, all local minima are global, and SGD&amp;rsquo;s noise gives implicit bias toward flat minima with better generalization.&lt;/li>
&lt;/ol>
&lt;p>Each is rigorous in its setting. The article also discusses what is &lt;strong>not&lt;/strong> known — there is no general theorem saying &amp;ldquo;SGD finds the global optimum of a deep network.&amp;rdquo;&lt;/p></description></item><item><title>Optimization (10): Stochastic Optimization and Variance Reduction</title><link>https://www.chenk.top/en/optimization-theory/10-stochastic-variance-reduction/</link><pubDate>Tue, 27 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/10-stochastic-variance-reduction/</guid><description>&lt;p>Stochastic gradient descent samples a single component gradient per step — far cheaper than full GD, but at what cost in convergence? Can we retain the cheap per-iteration cost while recovering the fast rate of deterministic methods? This article quantifies the tradeoff from a noise-budget perspective and derives the solution.&lt;/p>
&lt;span class="math-block">$$
\min_x f(x) := \frac{1}{n} \sum_{i=1}^n f_i(x),
$$&lt;/span>
&lt;p>
deterministic gradient descent costs &lt;span class="math-inline">$O(n)$&lt;/span>
 per step but converges in &lt;span class="math-inline">$O(\kappa \log(1/\epsilon))$&lt;/span>
 steps. &lt;strong>Stochastic gradient descent&lt;/strong> (SGD) costs &lt;span class="math-inline">$O(1)$&lt;/span>
 per step but converges in &lt;span class="math-inline">$O(1/\epsilon^2)$&lt;/span>
 for convex problems and &lt;span class="math-inline">$O(\kappa^2 \log(1/\epsilon))$&lt;/span>
 for strongly convex ones. Which is faster depends on &lt;span class="math-inline">$n$&lt;/span>
, &lt;span class="math-inline">$\kappa$&lt;/span>
, and &lt;span class="math-inline">$\epsilon$&lt;/span>
.&lt;/p></description></item><item><title>Optimization (9): Interior-Point Methods and Self-Concordant Barriers</title><link>https://www.chenk.top/en/optimization-theory/09-interior-point-barrier/</link><pubDate>Mon, 26 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/09-interior-point-barrier/</guid><description>&lt;p>In 1984 Karmarkar showed that LPs could be solved in polynomial time &lt;em>practically&lt;/em> — not just theoretically (the ellipsoid method had achieved this on paper). His &lt;strong>interior-point method&lt;/strong> stayed inside the feasible polytope and converged in &lt;span class="math-inline">$O(n L)$&lt;/span>
 iterations, far better than the simplex method&amp;rsquo;s exponential worst case. Within a decade, Nesterov &amp;amp; Nemirovski generalized this to &lt;strong>all convex programming&lt;/strong> via the &lt;strong>self-concordant barrier&lt;/strong> framework. The result — &lt;span class="math-inline">$O(\sqrt{n} \log(1/\epsilon))$&lt;/span>
 Newton iterations for an &lt;span class="math-inline">$n$&lt;/span>
-dimensional problem — remains the gold standard for medium-scale convex optimization.&lt;/p></description></item><item><title>Optimization (8): Lagrangian Duality and KKT Conditions</title><link>https://www.chenk.top/en/optimization-theory/08-lagrangian-duality-kkt/</link><pubDate>Sat, 24 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/08-lagrangian-duality-kkt/</guid><description>&lt;p>The most consequential idea in constrained optimization is that &lt;strong>constraints have prices&lt;/strong>. The Lagrangian transforms a constrained problem into an unconstrained one by attaching a non-negative multiplier to each inequality and a free multiplier to each equality. The resulting unconstrained problem may be easier (the SVM dual), or it may give a verifiable lower bound (the LP duality used to certify integer programs).&lt;/p>
&lt;p>This article develops:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Weak duality:&lt;/strong> the dual is always a lower bound on the primal — no assumptions needed.&lt;/li>
&lt;li>&lt;strong>Strong duality:&lt;/strong> under Slater&amp;rsquo;s condition (or convexity + linear constraints), the gap is zero.&lt;/li>
&lt;li>&lt;strong>KKT conditions:&lt;/strong> primal stationarity + dual feasibility + complementary slackness, the practical optimality system.&lt;/li>
&lt;li>&lt;strong>Saddle-point characterization:&lt;/strong> the Lagrangian&amp;rsquo;s saddle point coincides with the optimal primal&amp;ndash;dual pair.&lt;/li>
&lt;/ul>
&lt;p>Each result is proved or carefully cited. We close with the SVM example, where the dual cuts the problem dimension from &lt;span class="math-inline">$d$&lt;/span>
 (number of features) to &lt;span class="math-inline">$n$&lt;/span>
 (number of training points) — the original kernel-method magic.&lt;/p></description></item><item><title>Optimization (7): Second-Order Methods</title><link>https://www.chenk.top/en/optimization-theory/07-second-order-methods/</link><pubDate>Thu, 22 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/07-second-order-methods/</guid><description>&lt;p>First-order methods top out at &lt;span class="math-inline">$O(\sqrt{\kappa})$&lt;/span>
 iterations to reach &lt;span class="math-inline">$\epsilon$&lt;/span>
-accuracy (article 05). Second-order methods break this barrier by using curvature: Newton&amp;rsquo;s method has &lt;strong>quadratic&lt;/strong> local convergence — the number of correct digits doubles every iteration — and quasi-Newton methods retain most of this speed without computing the Hessian explicitly.&lt;/p>
&lt;p>The cost is in the per-iteration work: Newton solves an &lt;span class="math-inline">$n \times n$&lt;/span>
 linear system per step (&lt;span class="math-inline">$O(n^3)$&lt;/span>
), BFGS maintains an &lt;span class="math-inline">$n \times n$&lt;/span>
 matrix (&lt;span class="math-inline">$O(n^2)$&lt;/span>
 per step + &lt;span class="math-inline">$O(n^2)$&lt;/span>
 memory), and L-BFGS uses only &lt;span class="math-inline">$O(mn)$&lt;/span>
 memory for a chosen history &lt;span class="math-inline">$m$&lt;/span>
 (typically 5&amp;ndash;20).&lt;/p></description></item><item><title>Optimization (6): Composite Optimization and Proximal Methods</title><link>https://www.chenk.top/en/optimization-theory/06-composite-proximal-methods/</link><pubDate>Wed, 21 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/06-composite-proximal-methods/</guid><description>&lt;p>When your objective contains a non-smooth piece (sparse regularisation, total variation, an indicator of a constraint set) or a constraint that is hard to handle directly, &amp;ldquo;just do gradient descent&amp;rdquo; stalls — there is no gradient at the kink, or every step violates feasibility. The &lt;strong>proximal operator&lt;/strong> is the engineered, beautiful workaround: think of each update as &amp;ldquo;take a step on the smooth part, then run a tiny penalised minimisation that pulls the iterate back toward a structured solution&amp;rdquo;.&lt;/p></description></item><item><title>Optimization (5): Acceleration Beyond Nesterov</title><link>https://www.chenk.top/en/optimization-theory/05-acceleration-beyond-nesterov/</link><pubDate>Tue, 20 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/05-acceleration-beyond-nesterov/</guid><description>&lt;p>Article 02 introduced Nesterov acceleration and showed it improves the per-iteration cost from &lt;span class="math-inline">$\kappa$&lt;/span>
 to &lt;span class="math-inline">$\sqrt{\kappa}$&lt;/span>
. This article asks the deeper questions:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Why &lt;span class="math-inline">$\sqrt{\kappa}$&lt;/span>
 and not faster?&lt;/strong> We prove a matching lower bound — no first-order method can do better.&lt;/li>
&lt;li>&lt;strong>Is Nesterov the only way?&lt;/strong> Polyak&amp;rsquo;s Heavy-Ball method achieves the same rate using a completely different update rule.&lt;/li>
&lt;li>&lt;strong>Can we accelerate any solver?&lt;/strong> The Catalyst framework wraps a black-box optimizer to gain the accelerated rate, at the cost of solving a regularized subproblem.&lt;/li>
&lt;/ul>
&lt;p>The unifying tool is a &lt;strong>Lyapunov potential&lt;/strong> — a non-negative quantity that the algorithm decreases at every step. Both Nesterov and Heavy-Ball admit Lyapunov proofs, and the lower bound essentially says no Lyapunov decrease can happen faster.&lt;/p></description></item><item><title>Optimization (4): Learning Rate and Schedules</title><link>https://www.chenk.top/en/optimization-theory/04-learning-rate-schedules/</link><pubDate>Sun, 18 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/04-learning-rate-schedules/</guid><description>&lt;p>Your model diverges. You halve the learning rate. Now it trains, but takes forever. You halve again — now the loss is a flat line. Sound familiar? Of all the knobs you can turn, &lt;strong>learning rate&lt;/strong> is the one that most often decides whether training converges, crawls, or blows up. This guide gives you the intuition, the minimal math, and a practical workflow to get it right — from a 12-layer CNN on your laptop to a 70B-parameter LLM on a thousand GPUs.&lt;/p></description></item><item><title>Optimization (3): The Gradient Descent Family from SGD to AdamW</title><link>https://www.chenk.top/en/optimization-theory/03-gradient-descent-family/</link><pubDate>Fri, 16 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/03-gradient-descent-family/</guid><description>&lt;p>Why is &amp;ldquo;tuning the LR is an art&amp;rdquo; a meme for ResNet, while every modern LLM paper just writes &amp;ldquo;AdamW, &lt;span class="math-inline">$\beta_1{=}0.9, \beta_2{=}0.95, \mathrm{wd}{=}0.1$&lt;/span>
&amp;rdquo; and moves on? It is not an accident — it is the &lt;strong>end-point of three decades of optimizer evolution&lt;/strong>.&lt;/p>
&lt;p>This post walks the lineage end-to-end on a single thread: each step exists because of a &lt;strong>specific failure&lt;/strong> of the previous one. We end with the three directions that have actually entered the post-2023 large-model toolkit: Lion, Sophia, and Schedule-Free.&lt;/p></description></item><item><title>Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration</title><link>https://www.chenk.top/en/optimization-theory/02-smoothness-strong-convexity-nesterov/</link><pubDate>Thu, 15 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/02-smoothness-strong-convexity-nesterov/</guid><description>&lt;p>A surprising amount of &amp;ldquo;optimizer folklore&amp;rdquo; collapses into three concepts:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>How steep is the gradient?&lt;/strong> Lipschitz smoothness (&lt;span class="math-inline">$L$&lt;/span>
-smoothness) caps the step size.&lt;/li>
&lt;li>&lt;strong>How sharp is the bottom?&lt;/strong> &lt;span class="math-inline">$\mu$&lt;/span>
-strong convexity sets the convergence rate and forces the minimizer to be unique.&lt;/li>
&lt;li>&lt;strong>Can we get there faster without losing stability?&lt;/strong> Nesterov acceleration and adaptive restart turn the per-condition-number cost from &lt;span class="math-inline">$\kappa$&lt;/span>
 into &lt;span class="math-inline">$\sqrt{\kappa}$&lt;/span>
.&lt;/li>
&lt;/ul>
&lt;p>This post lays them out on a single thread: nail the geometric intuition with the minimum number of inequalities, prove the key theorems, then close with a least-squares experiment that pits GD, Heavy Ball, and Nesterov against each other. The goal is not to stack formulas — it is to make you able to look at a new problem and instantly answer &amp;ldquo;what step size, what rate, is acceleration worth it?&amp;rdquo;&lt;/p></description></item><item><title>Optimization (1): Convex Analysis Foundations</title><link>https://www.chenk.top/en/optimization-theory/01-convex-analysis-foundations/</link><pubDate>Wed, 14 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/01-convex-analysis-foundations/</guid><description>&lt;p>This article is the foundation the rest of the series is built on. Almost every result we will prove later — convergence rates of gradient descent, Lagrangian duality, the proximal operator, even the analysis of stochastic methods — relies on a small set of facts about convex sets and convex functions. We will derive all of them from scratch.&lt;/p>
&lt;p>If you only remember three things from this article, make it these:&lt;/p></description></item><item><title>Python Engineering (8): Performance — Profiling, Caching, and Knowing When to Stop</title><link>https://www.chenk.top/en/python-engineering/08-performance-and-profiling/</link><pubDate>Wed, 27 Apr 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/python-engineering/08-performance-and-profiling/</guid><description>&lt;p>Donald Knuth&amp;rsquo;s famous quote is often half-remembered. The full version is: &amp;ldquo;We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.&amp;rdquo; The second sentence is the key. Performance work isn&amp;rsquo;t about making everything fast; it&amp;rsquo;s about finding the 3% that matters and making that fast.&lt;/p>
&lt;p>This article is about finding that 3%. You&amp;rsquo;ll learn to profile first, optimize second, and measure the impact of each change.&lt;/p></description></item><item><title>Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms</title><link>https://www.chenk.top/en/kernel-methods/01-why-kernels/</link><pubDate>Wed, 24 Nov 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/01-why-kernels/</guid><description>&lt;p>The first time I tried to fit a logistic regression to a dataset of two interlocking spirals, I burned an afternoon tweaking the regularizer, swapping solvers, and rescaling features — convinced that I was doing something wrong. The accuracy hovered around 50%. That is the noise floor of a coin flip; my model was, in a very literal sense, learning nothing.&lt;/p>
&lt;p>The model was not buggy. The data was simply not the kind of object a straight line can describe. No amount of &lt;code>C&lt;/code>, &lt;code>class_weight&lt;/code>, or &lt;code>tol&lt;/code> was going to change that. Once you have seen this failure mode once, you start noticing it everywhere — in customer-churn data with non-monotone relationships, in image classification before deep learning, in any regression where the trend bends. A linear algorithm has a hard ceiling, and you only break through that ceiling by changing the kind of object the algorithm operates on.&lt;/p></description></item></channel></rss>