
Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration
Three concepts that demystify most of optimization: Lipschitz smoothness fixes the maximum step size, strong convexity sets the convergence rate and uniqueness of the minimizer, and Nesterov acceleration replaces kappa with sqrt(kappa) without sacrificing stability. Includes the key theorems with proofs and a least-squares experiment.
A surprising amount of “optimizer folklore” collapses into three concepts:
- How steep is the gradient? Lipschitz smoothness ($L$ -smoothness) caps the step size.
- How sharp is the bottom? $\mu$ -strong convexity sets the convergence rate and forces the minimizer to be unique.
- Can we get there faster without losing stability? Nesterov acceleration and adaptive restart turn the per-condition-number cost from $\kappa$ into $\sqrt{\kappa}$ .
This post lays them out on a single thread: nail the geometric intuition with the minimum number of inequalities, prove the key theorems, then close with a least-squares experiment that pits GD, Heavy Ball, and Nesterov against each other. The goal is not to stack formulas — it is to make you able to look at a new problem and instantly answer “what step size, what rate, is acceleration worth it?”
Anyone who has tuned a deep network knows the loop: bump the learning rate up a bit, the loss explodes; drop it, training crawls. Underneath that frustration there are really only three quantities in play:
- How steep is the gradient? This is Lipschitz smoothness $L$ — it caps how big a step you are allowed to take.
- How sharp is the bottom? This is the strong-convexity constant $\mu$ — it decides whether you converge linearly or just trudge along.
- Can we go faster without losing stability? Nesterov acceleration is the answer, and it cuts the per-condition-number cost from $\kappa$ down to $\sqrt{\kappa}$ .
What I want to do here is concrete: build a picture-level intuition for $L$ and $\mu$ first, then open up the magic-looking $\beta_t = (t-1)/(t+2)$ in Nesterov’s method so you see it isn’t pulled out of a hat. Once you finish, your reflex on a new problem should be: estimate $L$ and $\mu$ first, then decide whether acceleration is worth its complexity tax.
What You Will Learn#
- The geometric meaning of Lipschitz continuity: every point sits inside a slope cone that contains the function graph.
- The equivalent characterisation of $L$ -smoothness: the function is sandwiched from above by a family of quadratics (descent lemma).
- Strong convexity as a quadratic lower bound, with the existence and uniqueness of the minimizer as a free byproduct.
- Why the condition number $\kappa = L/\mu$ controls GD’s iteration complexity, and how Nesterov replaces it with $\sqrt{\kappa}$ .
- Why adaptive restart is still needed under strong convexity, demonstrated on least squares.
Prerequisites#
- Multivariate calculus (gradients, Hessians, Taylor expansion).
- Basic convex analysis (convex set, first-order condition).
Lipschitz continuity and gradient smoothness#
The geometric picture of Lipschitz continuity#
$$|f(y) - f(x)| \le L\,\|y - x\|.$$Geometrically, this is a two-sided cone constraint: at any anchor $(x_0, f(x_0))$ , the double cone of slope $\pm L$ must enclose the entire function graph. As soon as the function develops a near-vertical tangent (e.g. $\sqrt{|x|}$ near the origin), no finite $L$ can contain it and $f$ stops being Lipschitz.

Left: $\sin x$ is 1-Lipschitz; the orange cones at three sample anchors fully contain the graph. Right: $\mathrm{sign}(x)\sqrt{|x|}$ has a divergent slope at the origin; the red region is where the candidate $L=2$ cone is pierced.
Two corollaries that will keep coming back:
- Uniform continuity: take $\delta = \varepsilon / L$ .
- Closure: Lipschitz functions are closed under addition, scalar multiplication, and composition (constants multiply). This lets us decompose complicated models into Lipschitz building blocks with known constants.
Gradient-Lipschitz = $L$ -smoothness#
$$\|\nabla f(y) - \nabla f(x)\| \le L\,\|y - x\|.$$ $$\boxed{\,f(y) \le f(x) + \langle \nabla f(x), y - x\rangle + \frac{L}{2}\,\|y - x\|^2.\,}$$That is: $f$ is sandwiched from above by a family of upward parabolas with curvature $L$ . The tangent line is the worst-case lower envelope; the parabola is the actual upper envelope.

The blue curve is $f(x) = \tfrac{1}{2}\sin(2x) + \tfrac{1}{2}x^2$ . At each of the three anchor points the dashed quadratic $f(x_0) + f'(x_0)(x-x_0) + \tfrac{L}{2}(x-x_0)^2$ stays above $f$ globally, while the dotted tangent line only stays below where $f$ is locally convex.
$$f(y) \le f(x) - \eta\Big(1 - \frac{L\eta}{2}\Big)\|\nabla f(x)\|^2.$$As long as $\eta \le 1/L$ , the bracket is $\ge 1/2$ , so every step strictly decreases $f$ , with the decrease lower-bounded by a constant times $\|\nabla f\|^2$ . This is the real origin of “step size at most $1/L$ ” — not folklore, but a direct corollary of the descent lemma.

Three regimes on the same quadratic $f(x) = \tfrac{L}{2}x^2$ (here $L = 4$ ). Left: a small step ($\eta = 0.8/L$ ) creeps down monotonically. Middle: the canonical $\eta = 1/L$ lands at the optimum in one move from $x_0 = 1$ — fastest possible without overshoot. Right: cross the $2/L$ ceiling and the iterates blow up, the descent lemma broken. The tight band $[1/L,\, 2/L]$ is exactly where convergence is fastest and still safe; below it you waste iterations, above it you diverge.
Three worked examples#
| Function | Gradient | Spectral norm of Hessian | $L$ |
|---|---|---|---|
| $\tfrac{1}{2}\mid x\mid^2$ | $x$ | $1$ | $1$ |
| $\tfrac{1}{2}\mid Ax-b\mid^2$ | $A^\top(Ax-b)$ | $\lambda_{\max}(A^\top A)$ | $\lambda_{\max}(A^\top A)$ |
| Logistic $\log(1 + e^{-y\,\theta^\top x})$ (one sample) | $-y\,\sigma(-y\theta^\top x)\,x$ | $\sigma(\cdot)\sigma(-\cdot)\,xx^\top$ | $\tfrac{1}{4}\mid x\mid^2$ |
The third row gives the standard $L = \tfrac{1}{4n}\sum_i\|x_i\|^2$ for logistic regression — the bound $\sigma(\cdot)\sigma(-\cdot)\le 1/4$ does the work.
Hessian spectral bound implies gradient Lipschitz#
Theorem 1 (Hessian spectral norm $\Rightarrow$ Lipschitz gradient). If $f$ is twice differentiable with $\sup_x \|\nabla^2 f(x)\|_2 \le L$ , then $\nabla f$ is $L$ -Lipschitz.
$$\nabla f(y) - \nabla f(x) = \int_0^1 \nabla^2 f(x + t(y-x))\,(y-x)\,\mathrm dt.$$ $$\|\nabla f(y) - \nabla f(x)\| \le \int_0^1 \|\nabla^2 f(\cdot)\|_2\,\|y-x\|\,\mathrm dt \le L\,\|y-x\|.\quad\blacksquare$$For a quadratic $f(x) = \tfrac{1}{2}x^\top H x$ this gives $L = \lambda_{\max}(H)$ exactly — the same $L$ that the descent lemma uses, so the step rule $\eta \le 1/\lambda_{\max}(H)$ is tight.
Strong convexity: existence, uniqueness, quadratic growth#
Three equivalent definitions#
$$f(y) \ge f(x) + \langle \nabla f(x), y - x\rangle + \frac{\mu}{2}\,\|y - x\|^2.$$It has three equivalent forms, each handier in a different context:
- Quadratic lower bound (the inequality above): $f$ sits above a family of upward parabolas with curvature $\mu$ .
- $f - \tfrac{\mu}{2}\|x\|^2$ is convex: subtract off the “$\mu$ -curvature mass” and you are left with a convex function.
- Second-order condition (if $f$ is $C^2$ ): $\nabla^2 f(x) \succeq \mu I$ .

Left: a strongly convex function with three quadratic lower models (one per anchor). Right: $f(x) = 0.05\,x^4$ is convex but not strongly convex — it is too flat at the origin for any $\mu>0$ parabola to fit underneath. This is exactly why we cannot get linear convergence on this kind of objective.
Existence and uniqueness#
Theorem 2 (existence). If $f$ is lower semi-continuous and some sublevel set $\{x : f(x)\le\alpha\}$ is non-empty and bounded, $f$ attains its minimum on that set.
This is just Weierstrass: closed + bounded = compact, and lsc functions attain their infimum on compact sets.
Theorem 3 (uniqueness from strong convexity). A $\mu$ -strongly convex function ($\mu > 0$ ) on $\mathbb{R}^n$ has at most one global minimizer.
$$f(y^\star) \ge f(x^\star) + 0 + \frac{\mu}{2}\|y^\star - x^\star\|^2.$$But $f(y^\star) = f(x^\star)$ , so $\|y^\star - x^\star\| = 0$ . $\blacksquare$
$$f(y) - f^\star \ge \frac{\mu}{2}\|y - x^\star\|^2,$$i.e. the cost grows at least quadratically away from the minimizer. This is the lever that turns “small function value” into “small distance to optimum” in every convergence proof below.
$L$ -smooth and $\mu$ -strongly convex together: the condition number#
$$\mu I \preceq \nabla^2 f(x) \preceq L I,$$ $$\boxed{\,\kappa := \frac{L}{\mu} \ge 1\,}$$is the condition number, and it controls every rate that follows. Large $\kappa$ means the function is “long and narrow” — the steepest direction is on the verge of instability while the flattest direction barely contributes any signal. That is the structural source of optimization difficulty.
Accelerated gradient descent: from $\kappa$ to $\sqrt{\kappa}$ #
Two upper bounds for plain GD#
$$f(x_t) - f^\star \le \frac{L\,\|x_0 - x^\star\|^2}{2t} = \mathcal O(1/t).$$ $$\|x_t - x^\star\|^2 \le \Big(1 - \frac{1}{\kappa}\Big)^t \|x_0 - x^\star\|^2.$$Reaching error $\varepsilon$ takes $t = \mathcal O(\kappa\log(1/\varepsilon))$ . The condition number enters linearly: on a least-squares problem with $\kappa = 10^4$ , every additional factor of 10 in precision costs about another $2\times 10^4$ iterations.
Nesterov: lookahead trades $\kappa$ for $\sqrt{\kappa}$ #
$$x_{t+1} = x_t - \alpha\nabla f(x_t) + \beta(x_t - x_{t-1}).$$ $$ \begin{aligned} y_t &= x_t + \beta_t (x_t - x_{t-1}), \\ x_{t+1} &= y_t - \eta\,\nabla f(y_t). \end{aligned} $$Intuition: take a momentum-extrapolated peek at where you are about to land, then correct using the gradient there. That bit of foresight is enough to preserve acceleration on every $L$ -smooth convex function.
$$f(x_t) - f^\star \le \frac{2L\,\|x_0 - x^\star\|^2}{(t+1)^2}.$$ $$\beta = \frac{1 - \sqrt{1/\kappa}}{1 + \sqrt{1/\kappa}},$$we get $f(x_t) - f^\star \le \big(1 - \sqrt{1/\kappa}\big)^t (f(x_0) - f^\star)$ , hence $t = \mathcal O(\sqrt{\kappa}\log(1/\varepsilon))$ .

Left: 80 iterations on a rotated quadratic with $\kappa = 30$ . GD (blue, $\eta = 1.9/L$ to amplify the effect) overshoots in the steep direction and zigzags across the narrow valley, making slow net progress along the flat axis. Nesterov (purple, $\eta = 1/L$ ) builds up momentum along the valley and barely excites the steep mode at all. Right: distance to $x^\star$ on a log scale — the slope difference is the geometric realisation of the $\kappa$ vs $\sqrt{\kappa}$ rate gap that Theorems 5 and 7 predict.

Left ($\kappa=100$ ): on a log axis all three algorithms decrease linearly, but Nesterov (purple) and Heavy Ball (orange) have a steeper slope than GD (blue) — that is the $\sqrt{\kappa}$ vs $\kappa$ gap made visible. The dashed lines are the theoretical envelopes; they match the empirical traces. Right: zoom in on the first 80 iterations and Nesterov is non-monotone — it oscillates inside the lower envelope. That is the price of acceleration.
Adaptive restart fixes the side effects of acceleration#
Acceleration’s downside is non-monotonicity — the function value bumps up periodically. This becomes painful in two situations:
- $\mu$ unknown: Theorem 7 needs $\sqrt{1/\kappa}$ in the momentum; misestimating it kills the rate.
- Locally strongly convex: $f$ is only globally convex but strongly convex near $x^\star$ ; the convex-rate Nesterov underperforms.
Adaptive restart (O’Donoghue & Candès, 2015): whenever the gradient direction reverses ($\langle\nabla f(y_t), x_{t+1}-x_t\rangle > 0$ ) or the function value rises, reset the momentum and the iteration counter $t \leftarrow 1$ .
Theorem 8 (restart Nesterov hits the optimal rate without knowing $\mu$ ). For $\mu$ -strongly convex + $L$ -smooth $f$ , restart Nesterov still achieves $\mathcal O(\sqrt{\kappa}\log(1/\varepsilon))$ iterations, without needing $\mu$ .
Proof sketch: between two restarts, you run convex-rate Nesterov, so the gap drops as $\mathcal O(1/k^2)$ . Combined with the quadratic-growth corollary, this gives “each restart at least halves the gap.” A geometric halving $\log(1/\varepsilon)$ times reaches $\varepsilon$ . Each segment has length $\sim \sqrt{\kappa}$ , so total iterations are $\sim \sqrt{\kappa}\log(1/\varepsilon)$ .
How much does $\kappa$ actually matter#

Left (log-log): on synthetic least squares, the iterations to reach $10^{-6}$ relative gap grow strictly along $\kappa$ for GD and along $\sqrt{\kappa}$ for Nesterov. Right: the speedup ratio $T_{\text{GD}}/T_{\text{AGD}}$ tracks $\sqrt{\kappa}$ almost exactly. At $\kappa = 10^4$ the speedup is ~100x; at $\kappa = 100$ it is ~10x. The worse the conditioning, the more acceleration pays off.
A decision table#
| Situation | Recommendation | Why |
|---|---|---|
| Small $\kappa$ ($\le 50$ ), keep code simple | GD | Acceleration’s complexity payoff is small |
| Large $\kappa$ , $\mu$ known | Constant-momentum Nesterov (Thm 7) | Closed-form optimal $\beta$ |
| Large $\kappa$ , $\mu$ unknown / locally s.c. | Adaptive-restart Nesterov | Adapts to $\mu$ while keeping the optimal rate |
| Strictly convex quadratic (least squares) | Conjugate Gradient first | Finite termination in $\le n$ steps, beats Nesterov |
| Non-convex but locally near-convex (deep nets) | Momentum / Adam with warmup + cosine | Theoretical rates no longer provable; engineering rules dominate |
The least-squares experiment#
Problem and exact constants#
$$\min_{x\in\mathbb{R}^n} f(x) = \frac{1}{2}\|Ax - b\|^2, \qquad A\in\mathbb{R}^{m\times n},\; b\in\mathbb{R}^m.$$ $$\nabla f(x) = A^\top(Ax - b), \qquad \nabla^2 f(x) = A^\top A,$$ $$L = \lambda_{\max}(A^\top A), \qquad \mu = \lambda_{\min}(A^\top A), \qquad \kappa = \kappa(A^\top A) = \kappa(A)^2.$$Critical caveat. The condition number is the square of $A$ ’s condition number. So an $A$ that “looks fine” with $\kappa(A) = 100$ becomes a hard $\kappa = 10^4$ instance for least squares — and for the normal-equation gradient, no less.
Implementation#
| |
What we observe#
On a synthetic instance with $m = 200, n = 100$ , $\kappa(A) \approx 100$ (so $\kappa(A^\top A) \approx 10^4$ ):
- GD: needs ~$4\times 10^4$ iterations to reach $10^{-6}$ relative gradient norm; smooth but slow.
- Nesterov (constant momentum): ~400 iterations; the trace oscillates inside the trough as Figure 4 (right) predicts.
- Restart Nesterov: ~500 iterations and almost perfectly monotone — the most robust of the three.
The empirical speedup matches the $\sqrt{\kappa}$ prediction from Figure 5 ($10^4 \to 100\times$ ).
Summary and where to go next#
The cheat sheet#
| Assumption | Algorithm | Rate | Step size |
|---|---|---|---|
| $L$ -smooth, convex | GD | $\mathcal O(1/t)$ | $\eta = 1/L$ |
| $L$ -smooth, $\mu$ -strongly convex | GD | $\big(1 - 1/\kappa\big)^t$ | $\eta = 1/L$ |
| $L$ -smooth, convex | Nesterov | $\mathcal O(1/t^2)$ | $\eta = 1/L,\ \beta_t = (t-1)/(t+2)$ |
| $L$ -smooth, $\mu$ -strongly convex | Nesterov | $\big(1 - 1/\sqrt{\kappa}\big)^t$ | $\eta = 1/L,\ \beta = (1-\sqrt{1/\kappa})/(1+\sqrt{1/\kappa})$ |
| Same, $\mu$ unknown | Restart Nesterov | $\big(1 - 1/\sqrt{\kappa}\big)^t$ | adaptive |
Three reflexes for any new problem#
When a fresh objective lands on your desk, walk this loop:
- How steep? Estimate $L$ (largest eigenvalue or backtracking line search). It immediately gives you the step ceiling.
- How sharp at the bottom? Estimate $\mu$ (smallest eigenvalue or a PL constant). It tells you whether linear convergence is on the table.
- Is acceleration worth it? Look at $\kappa = L/\mu$ . $\kappa < 50$ : GD is fine. $\kappa \in [50, 10^4]$ : switch to Nesterov. $\kappa > 10^4$ : think about preconditioning or a second-order method.
Where the story continues#
- Non-convex + PL condition. Drop strong convexity; if $\tfrac{1}{2}\|\nabla f\|^2 \ge \mu(f - f^\star)$ holds, GD still converges linearly. This is the theoretical seed of “why over-parameterised deep nets train linearly” (Karimi et al., 2016).
- Acceleration with noise. Vanilla Nesterov is not robust to stochastic gradients. SAG / SVRG / Katyusha pull the strongly-convex stochastic rate back to the $\sqrt{\kappa}$ regime through variance reduction.
- Second-order acceleration. Sophia and Shampoo precondition by a (block-)diagonal Hessian, effectively rewriting the condition number — an active 2024 frontier in large-scale pretraining.
References#
- Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate $\mathcal O(1/k^2)$ . Soviet Mathematics Doklady, 27(2), 372–376.
- Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
- Bubeck, S. (2015). Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning, 8(3-4), 231–357.
- O’Donoghue, B., & Candès, E. (2015). Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15(3), 715–732.
- Karimi, H., Nutini, J., & Schmidt, M. (2016). Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. ECML-PKDD.
Optimization Theory 12 parts
- 01 Optimization (1): Convex Analysis Foundations
- 02 Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration you are here
- 03 Optimization (3): The Gradient Descent Family from SGD to AdamW
- 04 Optimization (4): Learning Rate and Schedules
- 05 Optimization (5): Acceleration Beyond Nesterov
- 06 Optimization (6): Composite Optimization and Proximal Methods
- 07 Optimization (7): Second-Order Methods
- 08 Optimization (8): Lagrangian Duality and KKT Conditions
- 09 Optimization (9): Interior-Point Methods and Self-Concordant Barriers
- 10 Optimization (10): Stochastic Optimization and Variance Reduction
- 11 Optimization (11): Non-Convex Optimization and Saddle Escape
- 12 Optimization (12): Discrete and Global Optimization