<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Algorithm on Chen Kai Blog</title><link>https://www.chenk.top/en/categories/algorithm/</link><description>Recent content in Algorithm on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Wed, 30 Jul 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/categories/algorithm/index.xml" rel="self" type="application/rss+xml"/><item><title>Reparameterization Trick &amp; Gumbel-Softmax: A Deep Dive</title><link>https://www.chenk.top/en/standalone/reparameterization-gumbel-softmax/</link><pubDate>Wed, 30 Jul 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/standalone/reparameterization-gumbel-softmax/</guid><description>&lt;p>The moment your model contains a sampling step, training hits a hard wall: &lt;strong>how do gradients flow through a random node?&lt;/strong>&lt;/p>
&lt;p>The reparameterization trick has a clean answer — rewrite &lt;span class="math-inline">$z\sim p_\theta(z)$&lt;/span>
 as &lt;span class="math-inline">$z=g_\theta(\epsilon)$&lt;/span>
, isolating the randomness in a parameter-free noise variable &lt;span class="math-inline">$\epsilon$&lt;/span>
, so backprop can flow through &lt;span class="math-inline">$g_\theta$&lt;/span>
. The trouble starts with discrete variables: operations like &lt;span class="math-inline">$\arg\max$&lt;/span>
 are not differentiable. &lt;strong>Gumbel-Softmax&lt;/strong> (a.k.a. the Concrete distribution) replaces the discrete sample with a tempered softmax over Gumbel-perturbed logits, giving you a smooth, differentiable surrogate that you can train end-to-end.&lt;/p></description></item><item><title>Low-Rank Matrix Approximation and the Pseudoinverse: From SVD to Regularization</title><link>https://www.chenk.top/en/standalone/low-rank-approximation-pseudoinverse/</link><pubDate>Mon, 28 Jul 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/standalone/low-rank-approximation-pseudoinverse/</guid><description>&lt;p>Real data matrices are almost never both square and full rank: correlated features, too few samples, and noise-induced ill-conditioning all make &amp;ldquo;matrix inverse&amp;rdquo; either undefined or numerically useless. The &lt;strong>pseudoinverse&lt;/strong> (Moore-Penrose inverse) preserves the &lt;em>spirit&lt;/em> of an inverse while dropping the impossible-to-meet requirements: it redefines the &amp;ldquo;solution&amp;rdquo; of a linear system as the &lt;strong>least-squares solution&lt;/strong>, breaking ties by picking the one with &lt;strong>minimum norm&lt;/strong>. This post derives the pseudoinverse from that least-squares viewpoint, gives the four Penrose conditions, builds it from the SVD, and connects this single object to &lt;strong>the Eckart-Young low-rank approximation theorem&lt;/strong>, &lt;strong>PCA&lt;/strong>, &lt;strong>recommender-system matrix factorization&lt;/strong>, and &lt;strong>LoRA fine-tuning&lt;/strong>.&lt;/p></description></item><item><title>Time Series Forecasting (8): Informer — Efficient Long-Sequence Forecasting</title><link>https://www.chenk.top/en/time-series/informer-long-sequence/</link><pubDate>Sun, 15 Dec 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/informer-long-sequence/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/time-series/informer-long-sequence/illustration_1.png" alt="Time Series Forecasting (8): Informer — Efficient Long-Sequence Forecasting — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;p>The Transformer is wonderful at sequence modeling — right up to the moment your sequence gets long. Vanilla self-attention costs &lt;span class="math-inline">$\mathcal{O}(L^2)$&lt;/span>
 in both compute and memory, so a one-week hourly window (168 steps) is fine, a one-month window (720 steps) is painful, and a three-month window (2160 steps) is essentially impossible on a single GPU. That is exactly the regime real-world long-horizon forecasting lives in: weather, energy, finance, IoT.&lt;/p></description></item><item><title>Time Series Forecasting (7): N-BEATS — Interpretable Deep Architecture</title><link>https://www.chenk.top/en/time-series/n-beats/</link><pubDate>Sat, 30 Nov 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/n-beats/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/time-series/n-beats/illustration_1.png" alt="Time Series Forecasting (7): N-BEATS — Interpretable Deep Architecture — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;p>The 2018 M4 forecasting competition served 100,000 series across six frequencies as a single benchmark. The leaderboard was dominated by hand-tuned ensembles built from decades of statistical-forecasting craft. Then a &lt;strong>pure neural network&lt;/strong> with no statistical preprocessing, no feature engineering, and no recurrence won outright. That network was &lt;strong>N-BEATS&lt;/strong> by Oreshkin et al. — a stack of fully-connected blocks with two residual paths. Its interpretable variant additionally split the forecast into a polynomial trend and a Fourier seasonality, so the very thing classical statisticians wanted (a readable decomposition) came for free.&lt;/p></description></item><item><title>Time Series Forecasting (6): Temporal Convolutional Networks (TCN)</title><link>https://www.chenk.top/en/time-series/temporal-convolutional-networks/</link><pubDate>Fri, 15 Nov 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/temporal-convolutional-networks/</guid><description>&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/time-series/temporal-convolutional-networks/illustration_1.png" alt="Time Series Forecasting (6): Temporal Convolutional Networks (TCN) — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;p>For most of the 2010s, saying &amp;ldquo;deep learning for time series&amp;rdquo; meant using LSTM. The story changed in 2018 when Bai, Kolter, and Koltun published &lt;em>An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling&lt;/em>. Their result was surprisingly simple: use a stack of 1-D convolutions, make them causal (no peeking at the future), space the filter taps exponentially (dilation), wrap the whole thing in residual connections, and train. Task after task, the resulting &lt;strong>Temporal Convolutional Network&lt;/strong> (TCN) matched or beat LSTM/GRU — while training several times faster because every time step in the forward pass runs in parallel.&lt;/p></description></item><item><title>Time Series Forecasting (5): Transformer Architecture for Time Series</title><link>https://www.chenk.top/en/time-series/transformer/</link><pubDate>Thu, 31 Oct 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/transformer/</guid><description>&lt;p>The 2017 &lt;em>Attention Is All You Need&lt;/em> paper took the attention mechanism from the previous chapter to its logical extreme: &lt;strong>drop the RNN entirely&lt;/strong>. Transformers stack pure attention into a full sequence model — no recurrence, no hidden state propagating over time. Originally designed for machine translation, the architecture was quickly adapted to every other sequence task, time series included.&lt;/p>
&lt;p>Dropping a vanilla NLP Transformer onto a time-series problem runs into two immediate complications. The first is &lt;strong>position&lt;/strong>. Attention is a set operation — shuffle the input order and the output is unchanged. For a time series, order is everything: a temperature curve that goes up-then-down and one that goes down-then-up are entirely different signals. NLP solves this with sinusoidal position encodings; do those still make sense for time series, or should we use learned encodings, or just concatenate calendar features (hour-of-day, day-of-week) directly into the input?&lt;/p></description></item><item><title>Time Series Forecasting (4): Attention Mechanisms — Direct Long-Range Dependencies</title><link>https://www.chenk.top/en/time-series/attention-mechanism/</link><pubDate>Wed, 16 Oct 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/attention-mechanism/</guid><description>&lt;p>RNNs and LSTMs handled &amp;ldquo;too many time steps&amp;rdquo; but left a subtler limitation in place: information has to travel &lt;strong>step by step&lt;/strong>. For step 100 to see what happened at step 1, the signal has to ride the hidden state through 99 intermediate stops — and each stop attenuates the signal a little and squashes it through a nonlinearity. Even with LSTM&amp;rsquo;s &amp;ldquo;highway&amp;rdquo; cell state, it&amp;rsquo;s still a single lane in a single direction.&lt;/p></description></item><item><title>Time Series Forecasting (3): GRU — Lightweight Gates and Efficiency Trade-offs</title><link>https://www.chenk.top/en/time-series/gru/</link><pubDate>Tue, 01 Oct 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/gru/</guid><description>&lt;p>After you&amp;rsquo;ve used LSTM for a while, an obvious question shows up: aren&amp;rsquo;t three gates a bit much? The forget and input gates seem to do related work — one decides what to drop, the other decides what to add — couldn&amp;rsquo;t they be merged? And does the cell state really need to be a separate vector from the hidden state, or could the hidden state do double duty?&lt;/p>
&lt;p>That is exactly the question Cho et al. answered in 2014 with the &lt;strong>Gated Recurrent Unit&lt;/strong>. They collapsed three gates into two: an &lt;strong>update gate&lt;/strong> that controls how much of the old state to keep versus how much new content to absorb, and a &lt;strong>reset gate&lt;/strong> that decides whether to ignore the old state entirely when computing a fresh candidate. The cell state is folded back into the hidden state. The result is roughly 25% fewer parameters, training that runs 10-15% faster, and accuracy on most time-series tasks that is statistically indistinguishable from LSTM.&lt;/p></description></item><item><title>Time Series Forecasting (2): LSTM — Gate Mechanisms and Long-Term Dependencies</title><link>https://www.chenk.top/en/time-series/lstm/</link><pubDate>Mon, 16 Sep 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/lstm/</guid><description>&lt;p>The first RNN I ever trained, back in 2017, was a small sales forecaster: 50 days in, the next day out. The forward pass ran cleanly, the loss went down, and yet the model had near-total amnesia about anything older than three days. The data had a clear monthly cycle. The model couldn&amp;rsquo;t see it. I assumed I needed more data, so I added rows and layers — and watched the training loss jump to NaN halfway through epoch two.&lt;/p></description></item><item><title>Time Series Forecasting (1): Traditional Statistical Models</title><link>https://www.chenk.top/en/time-series/01-traditional-models/</link><pubDate>Sun, 01 Sep 2024 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/time-series/01-traditional-models/</guid><description>&lt;p>The first time I touched data that &amp;ldquo;looked like a time series&amp;rdquo; — hourly server CPU usage — my instinct was to throw it at a linear regression. Time on the x-axis, usage on the y-axis. The fit was terrible. The problem wasn&amp;rsquo;t the regression; the problem was that this kind of data has its own personality. It has trends, seasonality, and a stubborn dependence between consecutive observations. A vanilla regression treats every row as an independent sample and throws away the one piece of information that matters most: time itself.&lt;/p></description></item><item><title>Position Encoding Brief: From Sinusoidal to RoPE and ALiBi</title><link>https://www.chenk.top/en/standalone/position-encoding-brief/</link><pubDate>Fri, 30 Jun 2023 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/standalone/position-encoding-brief/</guid><description>&lt;p>Self-attention has a strange property that surprises most people the first time they compute it by hand: it does not know the order of its inputs. Permute the tokens and every attention score is permuted along with them — the function is exactly equivariant. So before we can do anything useful with a Transformer, we have to inject position information from the outside.&lt;/p>
&lt;p>That single design decision — &lt;em>how&lt;/em> to inject it — has spawned a remarkable amount of research. Sinusoidal, learned, relative, T5-style buckets, RoPE, ALiBi, NoPE, and more. This post is a practitioner&amp;rsquo;s brief: enough math to know why each scheme works, enough comparison to choose one, and a clear focus on the property that matters most in the LLM era — &lt;strong>length extrapolation&lt;/strong>, the ability to handle sequences longer than anything seen in training.&lt;/p></description></item><item><title>Variational Autoencoder (VAE): From Intuition to Implementation and Troubleshooting</title><link>https://www.chenk.top/en/standalone/vae-guide/</link><pubDate>Tue, 27 Jun 2023 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/standalone/vae-guide/</guid><description>&lt;p>A plain autoencoder compresses and reconstructs. A variational autoencoder learns something far more useful: a smooth, structured latent space you can &lt;em>sample&lt;/em> from to generate genuinely new data. That single change — making the encoder output a &lt;em>distribution&lt;/em> instead of a vector — turns the network from a fancy compressor into a generative model with a tractable likelihood lower bound.&lt;/p>
&lt;p>This guide walks the full path: why autoencoders fail at generation, how the ELBO derivation gets you to the loss function, why the reparameterization trick is the trick that makes everything trainable, a complete PyTorch implementation, and a tour of every common failure mode with concrete fixes.&lt;/p></description></item><item><title>Optimization (12): Discrete and Global Optimization</title><link>https://www.chenk.top/en/optimization-theory/12-discrete-global-optimization/</link><pubDate>Fri, 30 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/12-discrete-global-optimization/</guid><description>&lt;p>The first eleven articles in this series tackled &lt;strong>continuous convex&lt;/strong> problems (or convex relaxations of non-convex ones). This final article addresses two harder regimes:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Discrete optimization&lt;/strong>: variables take integer or combinatorial values. The feasible set is a finite (but exponentially large) collection of points. Linear and convex tools no longer apply directly because there are no derivatives across the integer lattice.&lt;/li>
&lt;li>&lt;strong>Global non-convex optimization&lt;/strong>: variables are continuous but the function has many local minima, and we want the &lt;em>global&lt;/em> one. Methods like Newton and L-BFGS only find local minima.&lt;/li>
&lt;/ul>
&lt;p>Both regimes share a key feature: &lt;strong>provably optimal algorithms are exponential&lt;/strong> in the worst case. Practical solutions come from (a) exact algorithms with smart pruning (branch-and-bound) and (b) heuristics that find good (but not optimal) solutions in polynomial time.&lt;/p></description></item><item><title>Optimization (11): Non-Convex Optimization and Saddle Escape</title><link>https://www.chenk.top/en/optimization-theory/11-nonconvex-saddle-escape/</link><pubDate>Thu, 29 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/11-nonconvex-saddle-escape/</guid><description>&lt;p>For non-convex &lt;span class="math-inline">$f$&lt;/span>
, gradient descent has no global guarantee. The best we can say is that &lt;span class="math-inline">$\nabla f(x_t) \to 0$&lt;/span>
 — we converge to a stationary point, which could be a local min, a saddle, or even a local max. This article asks: when can we say more?&lt;/p>
&lt;p>Three positive results:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Saddle escape&lt;/strong>: under a &amp;ldquo;strict saddle&amp;rdquo; assumption, perturbed GD converges to local minima in polynomial time. Saddle points are unstable; Brownian noise (or just numerical perturbation) escapes them.&lt;/li>
&lt;li>&lt;strong>PL condition&lt;/strong>: a relaxation of strong convexity that holds in over-parameterized neural networks. Under PL, vanilla GD gets the linear rate &lt;span class="math-inline">$O(\log(1/\epsilon))$&lt;/span>
 even without convexity.&lt;/li>
&lt;li>&lt;strong>Loss landscape facts&lt;/strong>: for sufficiently wide neural networks, all local minima are global, and SGD&amp;rsquo;s noise gives implicit bias toward flat minima with better generalization.&lt;/li>
&lt;/ol>
&lt;p>Each is rigorous in its setting. The article also discusses what is &lt;strong>not&lt;/strong> known — there is no general theorem saying &amp;ldquo;SGD finds the global optimum of a deep network.&amp;rdquo;&lt;/p></description></item><item><title>Optimization (10): Stochastic Optimization and Variance Reduction</title><link>https://www.chenk.top/en/optimization-theory/10-stochastic-variance-reduction/</link><pubDate>Tue, 27 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/10-stochastic-variance-reduction/</guid><description>&lt;p>Stochastic gradient descent samples a single component gradient per step — far cheaper than full GD, but at what cost in convergence? Can we retain the cheap per-iteration cost while recovering the fast rate of deterministic methods? This article quantifies the tradeoff from a noise-budget perspective and derives the solution.&lt;/p>
&lt;span class="math-block">$$
\min_x f(x) := \frac{1}{n} \sum_{i=1}^n f_i(x),
$$&lt;/span>
&lt;p>
deterministic gradient descent costs &lt;span class="math-inline">$O(n)$&lt;/span>
 per step but converges in &lt;span class="math-inline">$O(\kappa \log(1/\epsilon))$&lt;/span>
 steps. &lt;strong>Stochastic gradient descent&lt;/strong> (SGD) costs &lt;span class="math-inline">$O(1)$&lt;/span>
 per step but converges in &lt;span class="math-inline">$O(1/\epsilon^2)$&lt;/span>
 for convex problems and &lt;span class="math-inline">$O(\kappa^2 \log(1/\epsilon))$&lt;/span>
 for strongly convex ones. Which is faster depends on &lt;span class="math-inline">$n$&lt;/span>
, &lt;span class="math-inline">$\kappa$&lt;/span>
, and &lt;span class="math-inline">$\epsilon$&lt;/span>
.&lt;/p></description></item><item><title>Optimization (9): Interior-Point Methods and Self-Concordant Barriers</title><link>https://www.chenk.top/en/optimization-theory/09-interior-point-barrier/</link><pubDate>Mon, 26 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/09-interior-point-barrier/</guid><description>&lt;p>In 1984 Karmarkar showed that LPs could be solved in polynomial time &lt;em>practically&lt;/em> — not just theoretically (the ellipsoid method had achieved this on paper). His &lt;strong>interior-point method&lt;/strong> stayed inside the feasible polytope and converged in &lt;span class="math-inline">$O(n L)$&lt;/span>
 iterations, far better than the simplex method&amp;rsquo;s exponential worst case. Within a decade, Nesterov &amp;amp; Nemirovski generalized this to &lt;strong>all convex programming&lt;/strong> via the &lt;strong>self-concordant barrier&lt;/strong> framework. The result — &lt;span class="math-inline">$O(\sqrt{n} \log(1/\epsilon))$&lt;/span>
 Newton iterations for an &lt;span class="math-inline">$n$&lt;/span>
-dimensional problem — remains the gold standard for medium-scale convex optimization.&lt;/p></description></item><item><title>Optimization (8): Lagrangian Duality and KKT Conditions</title><link>https://www.chenk.top/en/optimization-theory/08-lagrangian-duality-kkt/</link><pubDate>Sat, 24 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/08-lagrangian-duality-kkt/</guid><description>&lt;p>The most consequential idea in constrained optimization is that &lt;strong>constraints have prices&lt;/strong>. The Lagrangian transforms a constrained problem into an unconstrained one by attaching a non-negative multiplier to each inequality and a free multiplier to each equality. The resulting unconstrained problem may be easier (the SVM dual), or it may give a verifiable lower bound (the LP duality used to certify integer programs).&lt;/p>
&lt;p>This article develops:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Weak duality:&lt;/strong> the dual is always a lower bound on the primal — no assumptions needed.&lt;/li>
&lt;li>&lt;strong>Strong duality:&lt;/strong> under Slater&amp;rsquo;s condition (or convexity + linear constraints), the gap is zero.&lt;/li>
&lt;li>&lt;strong>KKT conditions:&lt;/strong> primal stationarity + dual feasibility + complementary slackness, the practical optimality system.&lt;/li>
&lt;li>&lt;strong>Saddle-point characterization:&lt;/strong> the Lagrangian&amp;rsquo;s saddle point coincides with the optimal primal&amp;ndash;dual pair.&lt;/li>
&lt;/ul>
&lt;p>Each result is proved or carefully cited. We close with the SVM example, where the dual cuts the problem dimension from &lt;span class="math-inline">$d$&lt;/span>
 (number of features) to &lt;span class="math-inline">$n$&lt;/span>
 (number of training points) — the original kernel-method magic.&lt;/p></description></item><item><title>Optimization (7): Second-Order Methods</title><link>https://www.chenk.top/en/optimization-theory/07-second-order-methods/</link><pubDate>Thu, 22 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/07-second-order-methods/</guid><description>&lt;p>First-order methods top out at &lt;span class="math-inline">$O(\sqrt{\kappa})$&lt;/span>
 iterations to reach &lt;span class="math-inline">$\epsilon$&lt;/span>
-accuracy (article 05). Second-order methods break this barrier by using curvature: Newton&amp;rsquo;s method has &lt;strong>quadratic&lt;/strong> local convergence — the number of correct digits doubles every iteration — and quasi-Newton methods retain most of this speed without computing the Hessian explicitly.&lt;/p>
&lt;p>The cost is in the per-iteration work: Newton solves an &lt;span class="math-inline">$n \times n$&lt;/span>
 linear system per step (&lt;span class="math-inline">$O(n^3)$&lt;/span>
), BFGS maintains an &lt;span class="math-inline">$n \times n$&lt;/span>
 matrix (&lt;span class="math-inline">$O(n^2)$&lt;/span>
 per step + &lt;span class="math-inline">$O(n^2)$&lt;/span>
 memory), and L-BFGS uses only &lt;span class="math-inline">$O(mn)$&lt;/span>
 memory for a chosen history &lt;span class="math-inline">$m$&lt;/span>
 (typically 5&amp;ndash;20).&lt;/p></description></item><item><title>Optimization (6): Composite Optimization and Proximal Methods</title><link>https://www.chenk.top/en/optimization-theory/06-composite-proximal-methods/</link><pubDate>Wed, 21 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/06-composite-proximal-methods/</guid><description>&lt;p>When your objective contains a non-smooth piece (sparse regularisation, total variation, an indicator of a constraint set) or a constraint that is hard to handle directly, &amp;ldquo;just do gradient descent&amp;rdquo; stalls — there is no gradient at the kink, or every step violates feasibility. The &lt;strong>proximal operator&lt;/strong> is the engineered, beautiful workaround: think of each update as &amp;ldquo;take a step on the smooth part, then run a tiny penalised minimisation that pulls the iterate back toward a structured solution&amp;rdquo;.&lt;/p></description></item><item><title>Optimization (5): Acceleration Beyond Nesterov</title><link>https://www.chenk.top/en/optimization-theory/05-acceleration-beyond-nesterov/</link><pubDate>Tue, 20 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/05-acceleration-beyond-nesterov/</guid><description>&lt;p>Article 02 introduced Nesterov acceleration and showed it improves the per-iteration cost from &lt;span class="math-inline">$\kappa$&lt;/span>
 to &lt;span class="math-inline">$\sqrt{\kappa}$&lt;/span>
. This article asks the deeper questions:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Why &lt;span class="math-inline">$\sqrt{\kappa}$&lt;/span>
 and not faster?&lt;/strong> We prove a matching lower bound — no first-order method can do better.&lt;/li>
&lt;li>&lt;strong>Is Nesterov the only way?&lt;/strong> Polyak&amp;rsquo;s Heavy-Ball method achieves the same rate using a completely different update rule.&lt;/li>
&lt;li>&lt;strong>Can we accelerate any solver?&lt;/strong> The Catalyst framework wraps a black-box optimizer to gain the accelerated rate, at the cost of solving a regularized subproblem.&lt;/li>
&lt;/ul>
&lt;p>The unifying tool is a &lt;strong>Lyapunov potential&lt;/strong> — a non-negative quantity that the algorithm decreases at every step. Both Nesterov and Heavy-Ball admit Lyapunov proofs, and the lower bound essentially says no Lyapunov decrease can happen faster.&lt;/p></description></item><item><title>Optimization (4): Learning Rate and Schedules</title><link>https://www.chenk.top/en/optimization-theory/04-learning-rate-schedules/</link><pubDate>Sun, 18 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/04-learning-rate-schedules/</guid><description>&lt;p>Your model diverges. You halve the learning rate. Now it trains, but takes forever. You halve again — now the loss is a flat line. Sound familiar? Of all the knobs you can turn, &lt;strong>learning rate&lt;/strong> is the one that most often decides whether training converges, crawls, or blows up. This guide gives you the intuition, the minimal math, and a practical workflow to get it right — from a 12-layer CNN on your laptop to a 70B-parameter LLM on a thousand GPUs.&lt;/p></description></item><item><title>Optimization (3): The Gradient Descent Family from SGD to AdamW</title><link>https://www.chenk.top/en/optimization-theory/03-gradient-descent-family/</link><pubDate>Fri, 16 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/03-gradient-descent-family/</guid><description>&lt;p>Why is &amp;ldquo;tuning the LR is an art&amp;rdquo; a meme for ResNet, while every modern LLM paper just writes &amp;ldquo;AdamW, &lt;span class="math-inline">$\beta_1{=}0.9, \beta_2{=}0.95, \mathrm{wd}{=}0.1$&lt;/span>
&amp;rdquo; and moves on? It is not an accident — it is the &lt;strong>end-point of three decades of optimizer evolution&lt;/strong>.&lt;/p>
&lt;p>This post walks the lineage end-to-end on a single thread: each step exists because of a &lt;strong>specific failure&lt;/strong> of the previous one. We end with the three directions that have actually entered the post-2023 large-model toolkit: Lion, Sophia, and Schedule-Free.&lt;/p></description></item><item><title>Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration</title><link>https://www.chenk.top/en/optimization-theory/02-smoothness-strong-convexity-nesterov/</link><pubDate>Thu, 15 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/02-smoothness-strong-convexity-nesterov/</guid><description>&lt;p>A surprising amount of &amp;ldquo;optimizer folklore&amp;rdquo; collapses into three concepts:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>How steep is the gradient?&lt;/strong> Lipschitz smoothness (&lt;span class="math-inline">$L$&lt;/span>
-smoothness) caps the step size.&lt;/li>
&lt;li>&lt;strong>How sharp is the bottom?&lt;/strong> &lt;span class="math-inline">$\mu$&lt;/span>
-strong convexity sets the convergence rate and forces the minimizer to be unique.&lt;/li>
&lt;li>&lt;strong>Can we get there faster without losing stability?&lt;/strong> Nesterov acceleration and adaptive restart turn the per-condition-number cost from &lt;span class="math-inline">$\kappa$&lt;/span>
 into &lt;span class="math-inline">$\sqrt{\kappa}$&lt;/span>
.&lt;/li>
&lt;/ul>
&lt;p>This post lays them out on a single thread: nail the geometric intuition with the minimum number of inequalities, prove the key theorems, then close with a least-squares experiment that pits GD, Heavy Ball, and Nesterov against each other. The goal is not to stack formulas — it is to make you able to look at a new problem and instantly answer &amp;ldquo;what step size, what rate, is acceleration worth it?&amp;rdquo;&lt;/p></description></item><item><title>Optimization (1): Convex Analysis Foundations</title><link>https://www.chenk.top/en/optimization-theory/01-convex-analysis-foundations/</link><pubDate>Wed, 14 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/01-convex-analysis-foundations/</guid><description>&lt;p>This article is the foundation the rest of the series is built on. Almost every result we will prove later — convergence rates of gradient descent, Lagrangian duality, the proximal operator, even the analysis of stochastic methods — relies on a small set of facts about convex sets and convex functions. We will derive all of them from scratch.&lt;/p>
&lt;p>If you only remember three things from this article, make it these:&lt;/p></description></item><item><title>Kernel Methods (8): Deep Kernel Learning vs Deep Learning — A Practitioner's Guide</title><link>https://www.chenk.top/en/kernel-methods/08-deep-kernels-vs-dl/</link><pubDate>Thu, 30 Dec 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/08-deep-kernels-vs-dl/</guid><description>&lt;p>In 2026, why are you still reading about kernel methods? Aren&amp;rsquo;t transformers supposed to have eaten the entire ML stack? Yes and no. Transformers eat the headlines, but kernels still eat the corners — the regimes with 200 samples, the regimes where the model has to publish calibrated error bars, the regimes where a physicist needs to know &lt;em>which&lt;/em> basis function caused the prediction. This final part is the field manual: when kernels actually win, how to debug them when they don&amp;rsquo;t, and how to bolt them on top of a neural network when you want the best of both worlds.&lt;/p></description></item><item><title>Kernel Methods (7): Large-Scale Kernels — Nystrom Approximation and Random Fourier Features</title><link>https://www.chenk.top/en/kernel-methods/07-large-scale-kernels/</link><pubDate>Fri, 24 Dec 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/07-large-scale-kernels/</guid><description>&lt;p>You want to train an RBF SVM on a million-image classification set. The Gram matrix is &lt;span class="math-inline">$10^6 \times 10^6$&lt;/span>
 doubles, which is &lt;strong>8 TB&lt;/strong>. That number alone — eight terabytes of RAM, just to &lt;em>store&lt;/em> the kernel — is why most working data scientists who learned kernel methods in a stats class quietly never reach for them on real production workloads. The kernel trick gives you an infinite-dimensional feature space for the cost of one dot product per pair; the bill arrives when you have &lt;span class="math-inline">$n^2$&lt;/span>
 pairs.&lt;/p></description></item><item><title>Kernel Methods (6): Gaussian Processes — When Kernels Meet Bayesian Inference</title><link>https://www.chenk.top/en/kernel-methods/06-gaussian-processes/</link><pubDate>Sun, 19 Dec 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/06-gaussian-processes/</guid><description>&lt;p>Kernel ridge regression gives you a number. You feed it &lt;span class="math-inline">$x_*$&lt;/span>
, it returns &lt;span class="math-inline">$\hat{y}_* = 23.7$&lt;/span>
. End of story. But you wanted to &lt;em>act&lt;/em> on that prediction — maybe schedule a delivery, dose a patient, place a bet — and the single number is not enough. Tomorrow&amp;rsquo;s temperature being &amp;ldquo;25°C&amp;rdquo; is useful; &amp;ldquo;very likely 25°C, 95% chance between 22 and 28&amp;rdquo; is &lt;em>actionable&lt;/em>. Every decision under uncertainty needs the second one. Gaussian Processes are the cleanest way to upgrade a kernel method from &amp;ldquo;point predictor&amp;rdquo; to &amp;ldquo;distribution predictor&amp;rdquo;, and they do it without abandoning a single line of the kernel math from the previous five parts.&lt;/p></description></item><item><title>Kernel Methods (5): Kernel SVM, Kernel PCA, and Kernel Ridge Regression</title><link>https://www.chenk.top/en/kernel-methods/05-kernel-algorithms/</link><pubDate>Tue, 14 Dec 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/05-kernel-algorithms/</guid><description>&lt;p>Your features are two-dimensional, your data is clearly a circle inside a circle, and &lt;code>LinearSVC&lt;/code> is at 50% accuracy with the wide-eyed look of an algorithm that genuinely believes a straight line is the answer. You stare at the scatter plot, you stare at the model, and somewhere in the back of your head the words &lt;em>kernel SVM&lt;/em> surface. You type &lt;code>kernel='rbf'&lt;/code>, the accuracy jumps to 0.98, and the rest of the afternoon you wonder what exactly just happened — and why the same trick also gives you a Kernel PCA that unfolds a Swiss roll and a Kernel Ridge regressor that fits a sine wave with three lines of code.&lt;/p></description></item><item><title>Kernel Methods (4): Common Kernel Families — RBF, Matern, Polynomial, Periodic, and More</title><link>https://www.chenk.top/en/kernel-methods/04-common-kernels/</link><pubDate>Thu, 09 Dec 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/04-common-kernels/</guid><description>&lt;p>You type &lt;code>SVC(kernel='rbf')&lt;/code> in scikit-learn for the first time. What did you set &lt;code>gamma&lt;/code> to? &lt;code>'scale'&lt;/code>? &lt;code>'auto'&lt;/code>? You scrolled past those defaults without thinking. Three months later your model is overfitting, your Gram matrix looks like the identity, and you have no idea which knob is wrong. Most &amp;ldquo;kernel tuning&amp;rdquo; debt is really &lt;em>kernel choice&lt;/em> debt — you picked the default kernel for the wrong reason, and now no amount of grid search will save you.&lt;/p></description></item><item><title>Kernel Methods (3): RKHS — The Theoretical Soul of Kernel Methods</title><link>https://www.chenk.top/en/kernel-methods/03-rkhs/</link><pubDate>Sat, 04 Dec 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/03-rkhs/</guid><description>&lt;p>If your eyes glaze over the moment a lecturer writes &amp;ldquo;RKHS&amp;rdquo; on the board, this part of the series is for you. RKHS is not a club of three intimidating letters — it is a function space, and once you see what lives inside it, kernel methods stop feeling like magic and start feeling like linear algebra you already know.&lt;/p>
&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/kernel-methods/03-rkhs/fig1_hilbert_space_concept.png" alt="A Hilbert-space cover for Part 3 of the kernel-methods series" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p></description></item><item><title>Kernel Methods (2): Mathematical Foundations — Positive-Definite Kernels and Mercer's Theorem</title><link>https://www.chenk.top/en/kernel-methods/02-kernel-math-foundations/</link><pubDate>Mon, 29 Nov 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/02-kernel-math-foundations/</guid><description>&lt;p>A week into kernel-SVM hacking I wrote what felt like a perfectly reasonable similarity function — &lt;code>tanh(1.5 * x.dot(y) - 2.0)&lt;/code>. It compiled, it ran, the math looked symmetric. Then sklearn coughed up &lt;code>ValueError: kernel matrix is not positive semidefinite&lt;/code> and the optimiser produced a model that was &lt;em>worse&lt;/em> than guessing.&lt;/p>
&lt;p>That error message turned out to hide one of the deepest results in 20th-century analysis. &amp;ldquo;Positive-definite&amp;rdquo; is not a checkbox — it is the entire reason the kernel trick is allowed to exist. If your function is PSD, there exists a Hilbert space where it is a real inner product; if it is not, you are pretending to live in a space that nobody built. This post unpacks that statement, builds the operational tests, derives Mercer&amp;rsquo;s theorem, and works through enough numerical examples that the next time you see the failure message you will know exactly which line of math your kernel violated.&lt;/p></description></item><item><title>Kernel Methods (1): Why We Need Them — Hitting the Ceiling of Linear Algorithms</title><link>https://www.chenk.top/en/kernel-methods/01-why-kernels/</link><pubDate>Wed, 24 Nov 2021 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/kernel-methods/01-why-kernels/</guid><description>&lt;p>The first time I tried to fit a logistic regression to a dataset of two interlocking spirals, I burned an afternoon tweaking the regularizer, swapping solvers, and rescaling features — convinced that I was doing something wrong. The accuracy hovered around 50%. That is the noise floor of a coin flip; my model was, in a very literal sense, learning nothing.&lt;/p>
&lt;p>The model was not buggy. The data was simply not the kind of object a straight line can describe. No amount of &lt;code>C&lt;/code>, &lt;code>class_weight&lt;/code>, or &lt;code>tol&lt;/code> was going to change that. Once you have seen this failure mode once, you start noticing it everywhere — in customer-churn data with non-monotone relationships, in image classification before deep learning, in any regression where the trend bends. A linear algorithm has a hard ceiling, and you only break through that ceiling by changing the kind of object the algorithm operates on.&lt;/p></description></item></channel></rss>