Optimization (10): Stochastic Optimization and Variance Reduction

Tue, 27 Sep 2022 09:00:00 +0000

Stochastic gradient descent samples a single component gradient per step — far cheaper than full GD, but at what cost in convergence? Can we retain the cheap per-iteration cost while recovering the fast rate of deterministic methods? This article quantifies the tradeoff from a noise-budget perspective and derives the solution.

\min_x f(x) := \frac{1}{n} \sum_{i=1}^n f_i(x),

deterministic gradient descent costs $$O(n)$$ per step but converges in $O(\kappa \log(1/\epsilon))$ steps. Stochastic gradient descent (SGD) costs $$O(1)$$ per step but converges in $O(1/\epsilon^2)$ for convex problems and $O(\kappa^2 \log(1/\epsilon))$ for strongly convex ones. Which is faster depends on $$n$$ , $\kappa$ , and $\epsilon$ .

Stochastic Methods on Chen Kai Blog

Optimization (10): Stochastic Optimization and Variance Reduction