<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Stochastic Methods on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/stochastic-methods/</link><description>Recent content in Stochastic Methods on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Tue, 27 Sep 2022 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/stochastic-methods/index.xml" rel="self" type="application/rss+xml"/><item><title>Optimization (10): Stochastic Optimization and Variance Reduction</title><link>https://www.chenk.top/en/optimization-theory/10-stochastic-variance-reduction/</link><pubDate>Tue, 27 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/10-stochastic-variance-reduction/</guid><description>&lt;p>Stochastic gradient descent samples a single component gradient per step — far cheaper than full GD, but at what cost in convergence? Can we retain the cheap per-iteration cost while recovering the fast rate of deterministic methods? This article quantifies the tradeoff from a noise-budget perspective and derives the solution.&lt;/p>
&lt;span class="math-block">$$
\min_x f(x) := \frac{1}{n} \sum_{i=1}^n f_i(x),
$$&lt;/span>
&lt;p>
deterministic gradient descent costs &lt;span class="math-inline">$O(n)$&lt;/span>
 per step but converges in &lt;span class="math-inline">$O(\kappa \log(1/\epsilon))$&lt;/span>
 steps. &lt;strong>Stochastic gradient descent&lt;/strong> (SGD) costs &lt;span class="math-inline">$O(1)$&lt;/span>
 per step but converges in &lt;span class="math-inline">$O(1/\epsilon^2)$&lt;/span>
 for convex problems and &lt;span class="math-inline">$O(\kappa^2 \log(1/\epsilon))$&lt;/span>
 for strongly convex ones. Which is faster depends on &lt;span class="math-inline">$n$&lt;/span>
, &lt;span class="math-inline">$\kappa$&lt;/span>
, and &lt;span class="math-inline">$\epsilon$&lt;/span>
.&lt;/p></description></item></channel></rss>