Series · Optimization Theory · Chapter 3

Optimization (3): The Gradient Descent Family from SGD to AdamW

One article that traces the full lineage GD -> SGD -> Momentum -> NAG -> AdaGrad -> RMSProp -> Adam -> AdamW, then onwards to Lion / Sophia / Schedule-Free. Each step is framed by the specific failure of the previous one, and we end with a practical selection guide.

Why is “tuning the LR is an art” a meme for ResNet, while every modern LLM paper just writes “AdamW, $\beta_1{=}0.9, \beta_2{=}0.95, \mathrm{wd}{=}0.1$ ” and moves on? It is not an accident — it is the end-point of three decades of optimizer evolution.

This post walks the lineage end-to-end on a single thread: each step exists because of a specific failure of the previous one. We end with the three directions that have actually entered the post-2023 large-model toolkit: Lion, Sophia, and Schedule-Free.


Why is “learning-rate tuning is an art” a meme around ResNet, while practically every modern LLM paper just shrugs and writes “AdamW with $\beta_1{=}0.9, \beta_2{=}0.95, \mathrm{wd}{=}0.1$ ” before moving on? It isn’t an accident. It is the endpoint of thirty-plus years of optimizer evolution.

I walked this lineage once myself, and what made it click was always the same thing: every new optimizer arrived because the previous one tripped over a specific failure mode in a real training run. This article walks the chain end to end, from plain gradient descent all the way to the post-2023 contenders that actually showed up in big-model toolkits — Lion, Sophia, Schedule-Free.

You don’t need to memorize Adam’s update rule to read this. You only need to be willing to keep asking, at each step, “what was wrong with the previous version?”

What You Will Learn#

  • Why GD zig-zags on ill-conditioned losses, and how momentum fixes it physically
  • The exact mathematical difference between Nesterov “lookahead” and classical momentum
  • Why AdaGrad is a killer on sparse features, and why it eventually “suffocates” in deep nets
  • How RMSProp rescued AdaGrad with a one-line change (exponential moving average)
  • How Adam stitches momentum and RMSProp together, and why bias correction matters
  • AdamW vs Adam: why “L2 == weight decay” stops being true once you put adaptive scaling in the denominator
  • Lion / Sophia / Schedule-Free: the three post-AdamW directions that scaled

Prerequisites#

  • Basic calculus (gradients, Hessian, Taylor expansion)
  • Some experience training a neural network (any framework)

The lineage at a glance#

YearAlgorithmSpecific problem it fixed
1847GDFormalized “step along the negative gradient”
1951SGDDatasets too big for full-batch gradients
1964MomentumGD zig-zags in narrow valleys
1983NAGPlain momentum overshoots near minima
2011AdaGradSparse features need per-coordinate LRs
2012RMSPropAdaGrad’s denominator suffocates the LR
2014AdamCombine direction (momentum) and scale (RMSProp)
2017AdamWAdam + L2 != Adam + weight decay
2023LionDrop the second moment; use sign of momentum
2023SophiaCheap diagonal-Hessian preconditioner
2024Schedule-FreeStop needing to know the total step count

The sections below follow this order.

Gradient descent (GD): the origin#

$$\theta_{t+1} = \theta_t - \eta\,\nabla J(\theta_t).$$

Convergence: if $J$ is convex with $L$ -Lipschitz gradient, $\eta \le 1/L$ guarantees (sub)linear convergence to the global minimum.

The fatal weakness that motivates everything else:

  • Under ill-conditioned curvature (Hessian condition number $\kappa = \lambda_{\max}/\lambda_{\min}$ large), the iteration count grows linearly in $\kappa$ . The 1-D intuition: $f(\theta)=\frac{1}{2}H\theta^2$ gives $\theta_{t+1}=(1-\eta H)\theta_t$ , stable iff $\eta < 2/H$ . Your step is capped by the curvature in the steepest direction.
  • When the steepest direction ($\lambda_{\max}$ ) and the flattest direction ($\lambda_{\min}$ ) differ by orders of magnitude, you barely move along the flat one but bounce back and forth along the steep one. That is the narrow-valley problem — visible in the left panel of Fig 1 below.

SGD: the price and bonus of noise#

$$g_t = \nabla J(\theta_t) + \xi_t,\qquad \mathbb{E}[\xi_t]=0.$$

The noise $\xi_t$ is both a curse and a blessing:

  • Curse: a slightly larger step gets amplified by noise into divergence.
  • Blessing: noise helps escape sharp local minima — later linked by Keskar et al. to the “flat-minima” generalization story.

Fig 1 middle panel: SGD’s trajectory in the same valley is hairier than GD’s, but on average it still flows toward the bottom.

Momentum: give the optimizer some inertia#

$$v_t = \gamma v_{t-1} + \eta\,g_t,\qquad \theta_{t+1} = \theta_t - v_t.$$

Typical $\gamma = 0.9$ — geometrically weights past gradients with effective memory $\approx 1/(1-\gamma) = 10$ steps.

Key insight: momentum amplifies the effective step size by roughly $1/(1-\gamma)$ . So when you turn momentum on, you must shrink the LR you used without it. This is the most common beginner trap.

Fig 1 right panel: the same valley, but momentum’s path is “straightened” — perpendicular oscillation cancels, longitudinal velocity accumulates.

GD / SGD / Momentum trajectories on an ill-conditioned quadratic

Nesterov accelerated gradient (NAG): peek before you leap#

Classical momentum overshoots near the minimum: it computes the gradient at the current point, so it only learns “oops, I went too far” one step too late.

$$v_t = \gamma v_{t-1} + \eta\,\nabla J(\theta_t - \gamma v_{t-1}),\qquad \theta_{t+1} = \theta_t - v_t.$$

The only difference is where you evaluate the gradient: classical momentum at $\theta_t$ , NAG at the lookahead point $\theta_t - \gamma v_{t-1}$ — i.e. “where the momentum step alone would have taken me”.

Why it works: it is a one-step look-ahead. If the slope is about to flatten, NAG sees that early and decelerates; the converse for steepening. Nesterov (1983) proved this accelerates convex smooth optimization from $O(1/t)$ to $O(1/t^2)$ .

NAG: lookahead gradient evaluation reduces overshoot

AdaGrad: every coordinate gets its own learning rate#

By 2011, NLP was drowning in sparse features — think word2vec where a rare word might appear 5 times in a million examples. With a single $\eta$ for everything:

  • Rare-word parameters: small gradients, but the same $\eta$ is either too big (kills them) or too small (they never learn).
  • Frequent-word parameters: big and frequent gradients, would prefer smaller steps.
$$G_t = G_{t-1} + g_t^2 \quad(\text{element-wise})$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t}+\epsilon}\,g_t.$$

Intuition: large accumulated $g^2$ -> large denominator -> small effective step. Rare-but-suddenly-large coordinate -> small denominator -> large effective step. LR is auto-distributed by frequency.

The fatal flaw: $G_t$ is a monotonically growing sum. Train deep nets for hundreds of thousands of steps and the denominator drives every effective LR toward zero. The model “suffocates”. The right panel of Fig 3 makes this concrete.

AdaGrad: shrinks the steep direction automatically, but every per-coord LR decays monotonically

RMSProp: replace cumulative sum with EMA#

$$E[g^2]_t = \rho\,E[g^2]_{t-1} + (1-\rho)\,g_t^2$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t}+\epsilon}\,g_t.$$

Typical $\rho = 0.9$ — “remember roughly the last 10 steps of gradient magnitude”.

The crucial difference:

  • AdaGrad: $G_t$ only ever grows -> LR only ever shrinks (irreversible).
  • RMSProp: $E[g^2]_t$ is a finite-window average -> when gradient magnitude changes, the denominator follows -> the LR can scale back up.

Fig 4 right panel shows this directly: at step 60 the gradient magnitude drops sharply. AdaGrad’s effective LR keeps falling; RMSProp’s effective LR climbs back to match the new regime.

RMSProp (EMA) vs AdaGrad (cumulative) under non-stationary gradient magnitude

Adam: stitch momentum and RMSProp together#

By now both threads were mature:

  • Momentum gives a good direction.
  • RMSProp gives a good per-coordinate scale.
$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\,g_t \quad\text{(1st moment = momentum)}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2)\,g_t^2 \quad\text{(2nd moment = RMSProp)}$$ $$\hat m_t = \frac{m_t}{1-\beta_1^t},\qquad \hat v_t = \frac{v_t}{1-\beta_2^t}$$ $$\theta_{t+1} = \theta_t - \frac{\eta\,\hat m_t}{\sqrt{\hat v_t}+\epsilon}.$$

Defaults: $\beta_1 = 0.9,\ \beta_2 = 0.999,\ \epsilon = 10^{-8}$ .

Why $\beta_2$ is much larger than $\beta_1$ : variance estimates are noisier than mean estimates and need a longer averaging window. $1/(1-0.999) = 1000$ steps — and that is exactly why Adam typically needs ~1000 warmup steps before $v_t$ “warms up”.

Adam dataflow: momentum branch + RMSProp branch -> bias correction -> adaptive update

AdamW: the weight-decay bug that lived for a decade#

Adding L2 regularization $\frac{\lambda}{2}\|\theta\|^2$ to the loss adds a term $\lambda\theta$ to the gradient. In SGD this is exactly equivalent to multiplying weights by $(1-\eta\lambda)$ each step — the classical “weight decay”.

But Loshchilov & Hutter (2017) noticed that in Adam these two operations are no longer equivalent. The reason is direct: Adam divides the gradient by $\sqrt{\hat v_t}$ . If you fold $\lambda\theta$ into the gradient, it also gets divided by $\sqrt{\hat v_t}$ — meaning parameters with large gradient history get less weight decay, which is the opposite of what regularization wants.

$$\theta_{t+1} = \theta_t - \eta\,\frac{\hat m_t}{\sqrt{\hat v_t}+\epsilon} - \eta\lambda\,\theta_t.$$

The effect: at the same $\lambda$ and LR, AdamW’s generalization gap on ImageNet/Transformer is meaningfully smaller than Adam+L2. This is why post-2018 every large-model pretrain defaults to AdamW.

AdamW (decoupled) vs Adam+L2 (coupled): where weight decay enters the update

The post-2023 frontier: three directions that scaled#

After AdamW reigned for ~6 years, three directions have actually proven themselves at scale since 2023.

Lion (Google, 2023): only the sign#

$$m_t = \beta_2 m_{t-1} + (1-\beta_2)\,g_t$$ $$\theta_{t+1} = \theta_t - \eta\,\mathrm{sign}\bigl(\beta_1 m_{t-1} + (1-\beta_1)\,g_t\bigr).$$

Notable properties:

  • Half the optimizer state: no $v_t$ needed — meaningful real money for hundred-billion-parameter models.
  • Constant update magnitude $\eta$ : because sign returns $\pm 1$ . So Lion’s LR must be about 10x smaller than AdamW’s, and wd about 10x larger.
  • On ViT and LLM pretraining, matches or slightly beats AdamW with faster wall-clock.

Sophia (Stanford, 2023): cheap second-order#

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\,g_t$$ $$h_t \approx \mathrm{diag}(H_t) \quad\text{(Hutchinson estimate, every } k \text{ steps)}$$ $$\theta_{t+1} = \theta_t - \eta\,\mathrm{clip}\!\left(\frac{m_t}{\max(\gamma h_t,\,\varepsilon)},\,1\right).$$

Core tricks:

  • Use $\mathrm{diag}(H)$ instead of $g^2$ as the denominator — that is the actual curvature.
  • The clip is essential: $h_t$ can be negative in non-convex losses, and clipping keeps the update bounded.
  • The Hessian probe runs only every $k$ steps, so amortized cost is modest.

Reported results: roughly halves the wall-clock to reach a given perplexity at GPT-2 scale.

Schedule-Free (Meta, 2024): drop the schedule#

LR schedules (cosine, WSD, etc.) all share one annoyance: you must know the total step count in advance. During research you usually do not, so committing to a schedule ties your hands.

$$y_t = (1-\beta) z_t + \beta x_t \quad\text{(point at which the gradient is taken)}$$ $$z_{t+1} = z_t - \eta\,\nabla J(y_t)$$ $$x_{t+1} = (1-c_t)\,x_t + c_t\,z_{t+1} \quad\text{(returned "averaged" parameters)}$$

The result: matches the final performance of cosine schedules without any explicit decay, and can be extended mid-training without redesigning anything.

Lion / Sophia / Schedule-Free: the three post-AdamW directions

Selection guide#

SettingRecommendationWhy
Convex / simple regressionGD or SGD-momentumStrong theory, easy tuning
CV baselineSGD + Nesterov + cosineHistorically best on ResNet/CNN
Transformer / LLM pretrainingAdamW + warmup + cosine/WSDIndustry default; close to free lunch
Memory-constrained large modelsLionSaves the 1st-moment-equivalent state (~1/3 memory)
Research, unknown training lengthSchedule-Free AdamWExtend mid-run, no schedule redesign
Chasing wall-clock SOTASophia2nd-order acceleration, but engineering cost

Five facts that get missed most often#

  1. If you turn momentum on, lower the LR. Momentum amplifies the effective step by roughly $1/(1-\gamma)$ . With $\gamma=0.9$ that is ~10x.
  2. Adam’s $\beta_2 = 0.999$ implies a ~1000-step warmup because $v_t$ has not “warmed up” before that.
  3. AdamW’s wd is decoupled from the LR. When the LR scheduler decays the LR, wd does NOT decay with it. This is the fundamental difference from the old SGD+L2 workflow.
  4. Lion’s LR must be ~10x smaller than AdamW’s. Copy-pasting AdamW’s 3e-4 will diverge immediately.
  5. Second-order methods looked “permanently impractical” not because they are bad, but because Hessians used to be too expensive. Sophia broke that wall by combining $\mathrm{diag}(H)$ with cheap Hutchinson estimation.

Optimizer state memory: the cost the math hides#

The clean derivation of momentum, AdaGrad, RMSProp, and Adam never mentions VRAM. In production it is the dominant constraint. For a model with $P$ trainable parameters in fp16:

OptimizerPer-param state (fp32)For 7 B paramsFor 70 B params
SGD000
SGD + momentum4 bytes28 GB280 GB
Adam / AdamW8 bytes ($m, v$ )56 GB560 GB
Lion4 bytes ($m$ only)28 GB280 GB
Sophia8 bytes ($m, h$ )56 GB560 GB
Adafactor~2 bytes (factored)14 GB140 GB
8-bit AdamW (bnb)2 bytes14 GB140 GB

This is the table that decides which optimizer you can actually run. A 7 B model in fp16 takes 14 GB for weights and 14 GB for gradients; AdamW adds another 56 GB on top, blowing past a single A100 80 GB. The same model with 8-bit AdamW or Adafactor fits comfortably. This is why the LLM-pretraining literature has an obsession with optimizer-state quantization that classical ML never had: the algorithm is fine, the memory is the bottleneck.

A practical rule of thumb: if optimizer state exceeds 1.5× model weight memory, you are likely going to want to either shard the optimizer (ZeRO-1), quantize it (bitsandbytes), or move to a state-lighter algorithm (Lion, Adafactor). The choice depends on whether your bottleneck is single-GPU memory or aggregate cluster memory.

Optimizer state memory across optimizers, and total VRAM at 7B / 70B scale

Mixed precision: where the optimizer sees fp32#

A subtle point that bites first-time pretrainers: the optimizer almost always operates in fp32 even when the rest of training is fp16/bf16. The reason is that Adam’s exponential moving averages accumulate over thousands of steps. With $\beta_2 = 0.999$ , $v_t$ is a sum where the smallest contributors are $10^{-3}$ of the largest, easily below fp16’s representable range ($\sim 6 \times 10^{-5}$ to $\sim 6 \times 10^4$ ).

The standard recipe is:

  1. Forward + backward in fp16/bf16. Gradients are fp16/bf16.
  2. Cast gradients to fp32 before the optimizer step.
  3. Optimizer state ($m, v$ , weights) lives in fp32.
  4. After the update, cast weights back to fp16/bf16 for the next forward pass.

This is what PyTorch’s torch.amp and DeepSpeed do automatically. It also explains why the “memory cost” rows above are in fp32 bytes — even if your model is fp16, the optimizer is paying fp32 per param.

bf16 changes the calculus slightly: its dynamic range is wide enough that you can sometimes keep gradients in bf16 throughout, but the optimizer state is still fp32 in every implementation worth using. For LLM pretraining at scale this is non-negotiable.

Mixed-precision training data flow: where each tensor lives in fp16/bf16 vs fp32

Learning rate sanity ranges per optimizer (Transformer baseline)#

These are the intervals I reach for first when starting a Transformer-class run. Treat them as Bayesian priors, not finals.

OptimizerLR rangeNotes
SGD + momentum0.01 – 0.5Linear warmup ~5 % of total steps
AdamW1e-4 – 6e-4Most LLMs land at 3e-4 ± 1.5x
Lion1e-5 – 6e-5Roughly 10× smaller than AdamW
Sophia1e-4 – 4e-4Also lower than AdamW
Schedule-Free AdamW1e-4 – 6e-4Same as AdamW; the schedule is what changes
Adafactorrel. step 0.01 – 0.05Uses relative step; literal LR is meaningless

These are first-shot priors. The real number depends on batch size (linear scaling rule for SGD, $\sqrt{}$ -ish scaling rule for Adam), warmup length, and dataset noise. But starting outside these intervals is almost always a configuration mistake, not a discovery.

Summary#

Three decades of optimizer evolution compress to two sentences:

  • GD to Adam: first solve the direction problem (momentum), then the scale problem (AdaGrad / RMSProp), then merge them and fix the bias (Adam).
  • Adam onwards: algorithmic improvement gives way to regularization detail (AdamW), memory efficiency (Lion), second-order information (Sophia), and schedule freedom (Schedule-Free).

If you only remember one thing: the LLM-era default is still AdamW + warmup + cosine/WSD + gradient clipping. Until you have a concrete bottleneck (memory, wall-clock, schedule flexibility), every paper claiming to beat AdamW deserves a baseline reproduction on your own task before you commit.

References#

In this series

Optimization Theory 12 parts

  1. 01 Optimization (1): Convex Analysis Foundations
  2. 02 Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration
  3. 03 Optimization (3): The Gradient Descent Family from SGD to AdamW you are here
  4. 04 Optimization (4): Learning Rate and Schedules
  5. 05 Optimization (5): Acceleration Beyond Nesterov
  6. 06 Optimization (6): Composite Optimization and Proximal Methods
  7. 07 Optimization (7): Second-Order Methods
  8. 08 Optimization (8): Lagrangian Duality and KKT Conditions
  9. 09 Optimization (9): Interior-Point Methods and Self-Concordant Barriers
  10. 10 Optimization (10): Stochastic Optimization and Variance Reduction
  11. 11 Optimization (11): Non-Convex Optimization and Saddle Escape
  12. 12 Optimization (12): Discrete and Global Optimization

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub