Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration

A surprising amount of “optimizer folklore” collapses into three concepts:

How steep is the gradient? Lipschitz smoothness ( $$L$$ -smoothness) caps the step size.
How sharp is the bottom? $\mu$ -strong convexity sets the convergence rate and forces the minimizer to be unique.
Can we get there faster without losing stability? Nesterov acceleration and adaptive restart turn the per-condition-number cost from $\kappa$ into $\sqrt{\kappa}$ .

This post lays them out on a single thread: nail the geometric intuition with the minimum number of inequalities, prove the key theorems, then close with a least-squares experiment that pits GD, Heavy Ball, and Nesterov against each other. The goal is not to stack formulas — it is to make you able to look at a new problem and instantly answer “what step size, what rate, is acceleration worth it?”

Anyone who has tuned a deep network knows the loop: bump the learning rate up a bit, the loss explodes; drop it, training crawls. Underneath that frustration there are really only three quantities in play:

How steep is the gradient? This is Lipschitz smoothness $$L$$ — it caps how big a step you are allowed to take.
How sharp is the bottom? This is the strong-convexity constant $\mu$ — it decides whether you converge linearly or just trudge along.
Can we go faster without losing stability? Nesterov acceleration is the answer, and it cuts the per-condition-number cost from $\kappa$ down to $\sqrt{\kappa}$ .

What I want to do here is concrete: build a picture-level intuition for $$L$$ and $\mu$ first, then open up the magic-looking $\beta_t = (t-1)/(t+2)$ in Nesterov’s method so you see it isn’t pulled out of a hat. Once you finish, your reflex on a new problem should be: estimate $$L$$ and $\mu$ first, then decide whether acceleration is worth its complexity tax.

What You Will Learn#

The geometric meaning of Lipschitz continuity: every point sits inside a slope cone that contains the function graph.
The equivalent characterisation of $$L$$ -smoothness: the function is sandwiched from above by a family of quadratics (descent lemma).
Strong convexity as a quadratic lower bound, with the existence and uniqueness of the minimizer as a free byproduct.
Why the condition number $\kappa = L/\mu$ controls GD’s iteration complexity, and how Nesterov replaces it with $\sqrt{\kappa}$ .
Why adaptive restart is still needed under strong convexity, demonstrated on least squares.

Prerequisites#

Multivariate calculus (gradients, Hessians, Taylor expansion).
Basic convex analysis (convex set, first-order condition).

Lipschitz continuity and gradient smoothness#

The geometric picture of Lipschitz continuity#

|f(y) - f(x)| \le L\,\|y - x\|.

Geometrically, this is a two-sided cone constraint: at any anchor $$(x_0, f(x_0))$$ , the double cone of slope $\pm L$ must enclose the entire function graph. As soon as the function develops a near-vertical tangent (e.g. $\sqrt{|x|}$ near the origin), no finite $$L$$ can contain it and $$f$$ stops being Lipschitz.

Lipschitz cone contains the graph; functions with vertical tangents cannot fit inside any finite-L cone

Left: $\sin x$ is 1-Lipschitz; the orange cones at three sample anchors fully contain the graph. Right: $\mathrm{sign}(x)\sqrt{|x|}$ has a divergent slope at the origin; the red region is where the candidate $$L=2$$ cone is pierced.

Two corollaries that will keep coming back:

Uniform continuity: take $\delta = \varepsilon / L$ .
Closure: Lipschitz functions are closed under addition, scalar multiplication, and composition (constants multiply). This lets us decompose complicated models into Lipschitz building blocks with known constants.

Gradient-Lipschitz = $$L$$ -smoothness#

\|\nabla f(y) - \nabla f(x)\| \le L\,\|y - x\|.

\boxed{\,f(y) \le f(x) + \langle \nabla f(x), y - x\rangle + \frac{L}{2}\,\|y - x\|^2.\,}

That is: $$f$$ is sandwiched from above by a family of upward parabolas with curvature $$L$$ . The tangent line is the worst-case lower envelope; the parabola is the actual upper envelope.

An L-smooth function is bounded above by a family of quadratic upper models

The blue curve is $f(x) = \tfrac{1}{2}\sin(2x) + \tfrac{1}{2}x^2$ . At each of the three anchor points the dashed quadratic $f(x_0) + f'(x_0)(x-x_0) + \tfrac{L}{2}(x-x_0)^2$ stays above $$f$$ globally, while the dotted tangent line only stays below where $$f$$ is locally convex.

f(y) \le f(x) - \eta\Big(1 - \frac{L\eta}{2}\Big)\|\nabla f(x)\|^2.

As long as $\eta \le 1/L$ , the bracket is $\ge 1/2$ , so every step strictly decreases $$f$$ , with the decrease lower-bounded by a constant times $\|\nabla f\|^2$ . This is the real origin of “step size at most $$1/L$$ ” — not folklore, but a direct corollary of the descent lemma.

GD on a 1D quadratic at three step sizes: contracting at eta=0.8/L, exact at eta=1/L, divergent at eta=2.2/L

Three regimes on the same quadratic $f(x) = \tfrac{L}{2}x^2$ (here $$L = 4$$ ). Left: a small step ( $\eta = 0.8/L$ ) creeps down monotonically. Middle: the canonical $\eta = 1/L$ lands at the optimum in one move from $$x_0 = 1$$ — fastest possible without overshoot. Right: cross the $$2/L$$ ceiling and the iterates blow up, the descent lemma broken. The tight band $[1/L,\, 2/L]$ is exactly where convergence is fastest and still safe; below it you waste iterations, above it you diverge.

Three worked examples#

Function	Gradient	Spectral norm of Hessian	$$L$$
$\tfrac{1}{2}\mid x\mid^2$	$$x$$	$$1$$	$$1$$
$\tfrac{1}{2}\mid Ax-b\mid^2$	$A^\top(Ax-b)$	$\lambda_{\max}(A^\top A)$	$\lambda_{\max}(A^\top A)$
Logistic $\log(1 + e^{-y\,\theta^\top x})$ (one sample)	$-y\,\sigma(-y\theta^\top x)\,x$	$\sigma(\cdot)\sigma(-\cdot)\,xx^\top$	$\tfrac{1}{4}\mid x\mid^2$

The third row gives the standard $L = \tfrac{1}{4n}\sum_i\|x_i\|^2$ for logistic regression — the bound $\sigma(\cdot)\sigma(-\cdot)\le 1/4$ does the work.

Hessian spectral bound implies gradient Lipschitz#

Theorem 1 (Hessian spectral norm $\Rightarrow$ Lipschitz gradient). If $$f$$ is twice differentiable with $\sup_x \|\nabla^2 f(x)\|_2 \le L$ , then $\nabla f$ is $$L$$ -Lipschitz.

\nabla f(y) - \nabla f(x) = \int_0^1 \nabla^2 f(x + t(y-x))\,(y-x)\,\mathrm dt.

\|\nabla f(y) - \nabla f(x)\| \le \int_0^1 \|\nabla^2 f(\cdot)\|_2\,\|y-x\|\,\mathrm dt \le L\,\|y-x\|.\quad\blacksquare

For a quadratic $f(x) = \tfrac{1}{2}x^\top H x$ this gives $L = \lambda_{\max}(H)$ exactly — the same $$L$$ that the descent lemma uses, so the step rule $\eta \le 1/\lambda_{\max}(H)$ is tight.

Strong convexity: existence, uniqueness, quadratic growth#

Three equivalent definitions#

f(y) \ge f(x) + \langle \nabla f(x), y - x\rangle + \frac{\mu}{2}\,\|y - x\|^2.

It has three equivalent forms, each handier in a different context:

Quadratic lower bound (the inequality above): $$f$$ sits above a family of upward parabolas with curvature $\mu$ .
$f - \tfrac{\mu}{2}\|x\|^2$ is convex: subtract off the “ $\mu$ -curvature mass” and you are left with a convex function.
Second-order condition (if $$f$$ is $$C^2$$ ): $\nabla^2 f(x) \succeq \mu I$ .

A strongly convex function is bounded BELOW by a family of mu-quadratics; convex but not strongly convex (mu = 0) for contrast

Left: a strongly convex function with three quadratic lower models (one per anchor). Right: $f(x) = 0.05\,x^4$ is convex but not strongly convex — it is too flat at the origin for any $\mu>0$ parabola to fit underneath. This is exactly why we cannot get linear convergence on this kind of objective.

Existence and uniqueness#

Theorem 2 (existence). If $$f$$ is lower semi-continuous and some sublevel set $\{x : f(x)\le\alpha\}$ is non-empty and bounded, $$f$$ attains its minimum on that set.

This is just Weierstrass: closed + bounded = compact, and lsc functions attain their infimum on compact sets.

Theorem 3 (uniqueness from strong convexity). A $\mu$ -strongly convex function ( $\mu > 0$ ) on $\mathbb{R}^n$ has at most one global minimizer.

f(y^\star) \ge f(x^\star) + 0 + \frac{\mu}{2}\|y^\star - x^\star\|^2.

But $f(y^\star) = f(x^\star)$ , so $\|y^\star - x^\star\| = 0$ . $\blacksquare$

f(y) - f^\star \ge \frac{\mu}{2}\|y - x^\star\|^2,

i.e. the cost grows at least quadratically away from the minimizer. This is the lever that turns “small function value” into “small distance to optimum” in every convergence proof below.

$$L$$ -smooth and $\mu$ -strongly convex together: the condition number#

\mu I \preceq \nabla^2 f(x) \preceq L I,

\boxed{\,\kappa := \frac{L}{\mu} \ge 1\,}

is the condition number, and it controls every rate that follows. Large $\kappa$ means the function is “long and narrow” — the steepest direction is on the verge of instability while the flattest direction barely contributes any signal. That is the structural source of optimization difficulty.

Accelerated gradient descent: from $\kappa$ to $\sqrt{\kappa}$ #

Two upper bounds for plain GD#

f(x_t) - f^\star \le \frac{L\,\|x_0 - x^\star\|^2}{2t} = \mathcal O(1/t).

\|x_t - x^\star\|^2 \le \Big(1 - \frac{1}{\kappa}\Big)^t \|x_0 - x^\star\|^2.

Reaching error $\varepsilon$ takes $t = \mathcal O(\kappa\log(1/\varepsilon))$ . The condition number enters linearly: on a least-squares problem with $\kappa = 10^4$ , every additional factor of 10 in precision costs about another $2\times 10^4$ iterations.

Nesterov: lookahead trades $\kappa$ for $\sqrt{\kappa}$ #

x_{t+1} = x_t - \alpha\nabla f(x_t) + \beta(x_t - x_{t-1}).

\begin{aligned} y_t &= x_t + \beta_t (x_t - x_{t-1}), \\ x_{t+1} &= y_t - \eta\,\nabla f(y_t). \end{aligned}

Intuition: take a momentum-extrapolated peek at where you are about to land, then correct using the gradient there. That bit of foresight is enough to preserve acceleration on every $$L$$ -smooth convex function.

f(x_t) - f^\star \le \frac{2L\,\|x_0 - x^\star\|^2}{(t+1)^2}.

\beta = \frac{1 - \sqrt{1/\kappa}}{1 + \sqrt{1/\kappa}},

we get $f(x_t) - f^\star \le \big(1 - \sqrt{1/\kappa}\big)^t (f(x_0) - f^\star)$ , hence $t = \mathcal O(\sqrt{\kappa}\log(1/\varepsilon))$ .

Trajectories on a rotated ill-conditioned quadratic: GD zigzags across the steep direction while Nesterov rolls smoothly along the valley

Left: 80 iterations on a rotated quadratic with $\kappa = 30$ . GD (blue, $\eta = 1.9/L$ to amplify the effect) overshoots in the steep direction and zigzags across the narrow valley, making slow net progress along the flat axis. Nesterov (purple, $\eta = 1/L$ ) builds up momentum along the valley and barely excites the steep mode at all. Right: distance to $x^\star$ on a log scale — the slope difference is the geometric realisation of the $\kappa$ vs $\sqrt{\kappa}$ rate gap that Theorems 5 and 7 predict.

GD vs Heavy Ball vs Nesterov on a strongly convex quadratic

Left ( $\kappa=100$ ): on a log axis all three algorithms decrease linearly, but Nesterov (purple) and Heavy Ball (orange) have a steeper slope than GD (blue) — that is the $\sqrt{\kappa}$ vs $\kappa$ gap made visible. The dashed lines are the theoretical envelopes; they match the empirical traces. Right: zoom in on the first 80 iterations and Nesterov is non-monotone — it oscillates inside the lower envelope. That is the price of acceleration.

Adaptive restart fixes the side effects of acceleration#

Acceleration’s downside is non-monotonicity — the function value bumps up periodically. This becomes painful in two situations:

$\mu$ unknown: Theorem 7 needs $\sqrt{1/\kappa}$ in the momentum; misestimating it kills the rate.
Locally strongly convex: $$f$$ is only globally convex but strongly convex near $x^\star$ ; the convex-rate Nesterov underperforms.

Adaptive restart (O’Donoghue & Candès, 2015): whenever the gradient direction reverses ( $\langle\nabla f(y_t), x_{t+1}-x_t\rangle > 0$ ) or the function value rises, reset the momentum and the iteration counter $t \leftarrow 1$ .

Theorem 8 (restart Nesterov hits the optimal rate without knowing $\mu$ ). For $\mu$ -strongly convex + $$L$$ -smooth $$f$$ , restart Nesterov still achieves $\mathcal O(\sqrt{\kappa}\log(1/\varepsilon))$ iterations, without needing $\mu$ .

Proof sketch: between two restarts, you run convex-rate Nesterov, so the gap drops as $\mathcal O(1/k^2)$ . Combined with the quadratic-growth corollary, this gives “each restart at least halves the gap.” A geometric halving $\log(1/\varepsilon)$ times reaches $\varepsilon$ . Each segment has length $\sim \sqrt{\kappa}$ , so total iterations are $\sim \sqrt{\kappa}\log(1/\varepsilon)$ .

How much does $\kappa$ actually matter#

Condition number kappa controls iterations: GD grows linearly, Nesterov as sqrt(kappa)

Left (log-log): on synthetic least squares, the iterations to reach $10^{-6}$ relative gap grow strictly along $\kappa$ for GD and along $\sqrt{\kappa}$ for Nesterov. Right: the speedup ratio $T_{\text{GD}}/T_{\text{AGD}}$ tracks $\sqrt{\kappa}$ almost exactly. At $\kappa = 10^4$ the speedup is ~100x; at $\kappa = 100$ it is ~10x. The worse the conditioning, the more acceleration pays off.

A decision table#

Situation	Recommendation	Why
Small $\kappa$ ( $\le 50$ ), keep code simple	GD	Acceleration’s complexity payoff is small
Large $\kappa$ , $\mu$ known	Constant-momentum Nesterov (Thm 7)	Closed-form optimal $\beta$
Large $\kappa$ , $\mu$ unknown / locally s.c.	Adaptive-restart Nesterov	Adapts to $\mu$ while keeping the optimal rate
Strictly convex quadratic (least squares)	Conjugate Gradient first	Finite termination in $\le n$ steps, beats Nesterov
Non-convex but locally near-convex (deep nets)	Momentum / Adam with warmup + cosine	Theoretical rates no longer provable; engineering rules dominate

The least-squares experiment#

Problem and exact constants#

\min_{x\in\mathbb{R}^n} f(x) = \frac{1}{2}\|Ax - b\|^2, \qquad A\in\mathbb{R}^{m\times n},\; b\in\mathbb{R}^m.

\nabla f(x) = A^\top(Ax - b), \qquad \nabla^2 f(x) = A^\top A,

L = \lambda_{\max}(A^\top A), \qquad \mu = \lambda_{\min}(A^\top A), \qquad \kappa = \kappa(A^\top A) = \kappa(A)^2.

Critical caveat. The condition number is the square of $$A$$ ’s condition number. So an $$A$$ that “looks fine” with $\kappa(A) = 100$ becomes a hard $\kappa = 10^4$ instance for least squares — and for the normal-equation gradient, no less.

Implementation#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np

def gd(A, b, n_iter, eta=None):
    """Gradient descent with step 1/L."""
    L = np.linalg.eigvalsh(A.T @ A).max()
    eta = eta or 1.0 / L
    x = np.zeros(A.shape[1])
    hist = []
    for _ in range(n_iter):
        g = A.T @ (A @ x - b)
        hist.append(np.linalg.norm(g))
        x -= eta * g
    return x, hist

def nesterov_strongconvex(A, b, n_iter):
    """Nesterov AGD using known mu, L (constant-momentum form)."""
    eigs = np.linalg.eigvalsh(A.T @ A)
    L, mu = eigs.max(), max(eigs.min(), 1e-12)
    eta = 1.0 / L
    sk = np.sqrt(mu / L)
    beta = (1 - sk) / (1 + sk)
    x_prev = x = np.zeros(A.shape[1])
    hist = []
    for _ in range(n_iter):
        y = x + beta * (x - x_prev)
        g = A.T @ (A @ y - b)
        hist.append(np.linalg.norm(A.T @ (A @ x - b)))
        x_prev, x = x, y - eta * g
    return x, hist

def nesterov_restart(A, b, n_iter):
    """Adaptive-restart Nesterov: reset momentum on function-value uptick."""
    L = np.linalg.eigvalsh(A.T @ A).max()
    eta = 1.0 / L
    x_prev = x = np.zeros(A.shape[1])
    t = 1
    f_prev = 0.5 * np.linalg.norm(A @ x - b) ** 2
    hist = []
    for _ in range(n_iter):
        beta = (t - 1) / (t + 2)
        y = x + beta * (x - x_prev)
        g = A.T @ (A @ y - b)
        x_new = y - eta * g
        f_new = 0.5 * np.linalg.norm(A @ x_new - b) ** 2
        if f_new > f_prev:               # restart trigger
            t = 1
            x_new = x - eta * (A.T @ (A @ x - b))
            f_new = 0.5 * np.linalg.norm(A @ x_new - b) ** 2
        else:
            t += 1
        hist.append(np.linalg.norm(A.T @ (A @ x_new - b)))
        x_prev, x, f_prev = x, x_new, f_new
    return x, hist

What we observe#

On a synthetic instance with $$m = 200, n = 100$$ , $\kappa(A) \approx 100$ (so $\kappa(A^\top A) \approx 10^4$ ):

GD: needs ~ $4\times 10^4$ iterations to reach $10^{-6}$ relative gradient norm; smooth but slow.
Nesterov (constant momentum): ~400 iterations; the trace oscillates inside the trough as Figure 4 (right) predicts.
Restart Nesterov: ~500 iterations and almost perfectly monotone — the most robust of the three.

The empirical speedup matches the $\sqrt{\kappa}$ prediction from Figure 5 ( $10^4 \to 100\times$ ).

Summary and where to go next#

The cheat sheet#

Assumption	Algorithm	Rate	Step size
$$L$$ -smooth, convex	GD	$\mathcal O(1/t)$	$\eta = 1/L$
$$L$$ -smooth, $\mu$ -strongly convex	GD	$\big(1 - 1/\kappa\big)^t$	$\eta = 1/L$
$$L$$ -smooth, convex	Nesterov	$\mathcal O(1/t^2)$	$\eta = 1/L,\ \beta_t = (t-1)/(t+2)$
$$L$$ -smooth, $\mu$ -strongly convex	Nesterov	$\big(1 - 1/\sqrt{\kappa}\big)^t$	$\eta = 1/L,\ \beta = (1-\sqrt{1/\kappa})/(1+\sqrt{1/\kappa})$
Same, $\mu$ unknown	Restart Nesterov	$\big(1 - 1/\sqrt{\kappa}\big)^t$	adaptive

Three reflexes for any new problem#

When a fresh objective lands on your desk, walk this loop:

How steep? Estimate $$L$$ (largest eigenvalue or backtracking line search). It immediately gives you the step ceiling.
How sharp at the bottom? Estimate $\mu$ (smallest eigenvalue or a PL constant). It tells you whether linear convergence is on the table.
Is acceleration worth it? Look at $\kappa = L/\mu$ . $\kappa < 50$ : GD is fine. $\kappa \in [50, 10^4]$ : switch to Nesterov. $\kappa > 10^4$ : think about preconditioning or a second-order method.

Where the story continues#

Non-convex + PL condition. Drop strong convexity; if $\tfrac{1}{2}\|\nabla f\|^2 \ge \mu(f - f^\star)$ holds, GD still converges linearly. This is the theoretical seed of “why over-parameterised deep nets train linearly” (Karimi et al., 2016).
Acceleration with noise. Vanilla Nesterov is not robust to stochastic gradients. SAG / SVRG / Katyusha pull the strongly-convex stochastic rate back to the $\sqrt{\kappa}$ regime through variance reduction.
Second-order acceleration. Sophia and Shampoo precondition by a (block-)diagonal Hessian, effectively rewriting the condition number — an active 2024 frontier in large-scale pretraining.

References#

Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate $\mathcal O(1/k^2)$ . Soviet Mathematics Doklady, 27(2), 372–376.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Bubeck, S. (2015). Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning, 8(3-4), 231–357.
O’Donoghue, B., & Candès, E. (2015). Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15(3), 715–732.
Karimi, H., Nutini, J., & Schmidt, M. (2016). Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. ECML-PKDD.

Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration

What You Will Learn#

Prerequisites#

Lipschitz continuity and gradient smoothness#