
Optimization (3): The Gradient Descent Family from SGD to AdamW
One article that traces the full lineage GD -> SGD -> Momentum -> NAG -> AdaGrad -> RMSProp -> Adam -> AdamW, then onwards to Lion / Sophia / Schedule-Free. Each step is framed by the specific failure of the previous one, and we end with a practical selection guide.
Why is “tuning the LR is an art” a meme for ResNet, while every modern LLM paper just writes “AdamW, $\beta_1{=}0.9, \beta_2{=}0.95, \mathrm{wd}{=}0.1$ ” and moves on? It is not an accident — it is the end-point of three decades of optimizer evolution.
This post walks the lineage end-to-end on a single thread: each step exists because of a specific failure of the previous one. We end with the three directions that have actually entered the post-2023 large-model toolkit: Lion, Sophia, and Schedule-Free.
Why is “learning-rate tuning is an art” a meme around ResNet, while practically every modern LLM paper just shrugs and writes “AdamW with $\beta_1{=}0.9, \beta_2{=}0.95, \mathrm{wd}{=}0.1$ ” before moving on? It isn’t an accident. It is the endpoint of thirty-plus years of optimizer evolution.
I walked this lineage once myself, and what made it click was always the same thing: every new optimizer arrived because the previous one tripped over a specific failure mode in a real training run. This article walks the chain end to end, from plain gradient descent all the way to the post-2023 contenders that actually showed up in big-model toolkits — Lion, Sophia, Schedule-Free.
You don’t need to memorize Adam’s update rule to read this. You only need to be willing to keep asking, at each step, “what was wrong with the previous version?”
What You Will Learn#
- Why GD zig-zags on ill-conditioned losses, and how momentum fixes it physically
- The exact mathematical difference between Nesterov “lookahead” and classical momentum
- Why AdaGrad is a killer on sparse features, and why it eventually “suffocates” in deep nets
- How RMSProp rescued AdaGrad with a one-line change (exponential moving average)
- How Adam stitches momentum and RMSProp together, and why bias correction matters
- AdamW vs Adam: why “L2 == weight decay” stops being true once you put adaptive scaling in the denominator
- Lion / Sophia / Schedule-Free: the three post-AdamW directions that scaled
Prerequisites#
- Basic calculus (gradients, Hessian, Taylor expansion)
- Some experience training a neural network (any framework)
The lineage at a glance#
| Year | Algorithm | Specific problem it fixed |
|---|---|---|
| 1847 | GD | Formalized “step along the negative gradient” |
| 1951 | SGD | Datasets too big for full-batch gradients |
| 1964 | Momentum | GD zig-zags in narrow valleys |
| 1983 | NAG | Plain momentum overshoots near minima |
| 2011 | AdaGrad | Sparse features need per-coordinate LRs |
| 2012 | RMSProp | AdaGrad’s denominator suffocates the LR |
| 2014 | Adam | Combine direction (momentum) and scale (RMSProp) |
| 2017 | AdamW | Adam + L2 != Adam + weight decay |
| 2023 | Lion | Drop the second moment; use sign of momentum |
| 2023 | Sophia | Cheap diagonal-Hessian preconditioner |
| 2024 | Schedule-Free | Stop needing to know the total step count |
The sections below follow this order.
Gradient descent (GD): the origin#
$$\theta_{t+1} = \theta_t - \eta\,\nabla J(\theta_t).$$Convergence: if $J$ is convex with $L$ -Lipschitz gradient, $\eta \le 1/L$ guarantees (sub)linear convergence to the global minimum.
The fatal weakness that motivates everything else:
- Under ill-conditioned curvature (Hessian condition number $\kappa = \lambda_{\max}/\lambda_{\min}$ large), the iteration count grows linearly in $\kappa$ . The 1-D intuition: $f(\theta)=\frac{1}{2}H\theta^2$ gives $\theta_{t+1}=(1-\eta H)\theta_t$ , stable iff $\eta < 2/H$ . Your step is capped by the curvature in the steepest direction.
- When the steepest direction ($\lambda_{\max}$ ) and the flattest direction ($\lambda_{\min}$ ) differ by orders of magnitude, you barely move along the flat one but bounce back and forth along the steep one. That is the narrow-valley problem — visible in the left panel of Fig 1 below.
SGD: the price and bonus of noise#
$$g_t = \nabla J(\theta_t) + \xi_t,\qquad \mathbb{E}[\xi_t]=0.$$The noise $\xi_t$ is both a curse and a blessing:
- Curse: a slightly larger step gets amplified by noise into divergence.
- Blessing: noise helps escape sharp local minima — later linked by Keskar et al. to the “flat-minima” generalization story.
Fig 1 middle panel: SGD’s trajectory in the same valley is hairier than GD’s, but on average it still flows toward the bottom.
Momentum: give the optimizer some inertia#
$$v_t = \gamma v_{t-1} + \eta\,g_t,\qquad \theta_{t+1} = \theta_t - v_t.$$Typical $\gamma = 0.9$ — geometrically weights past gradients with effective memory $\approx 1/(1-\gamma) = 10$ steps.
Key insight: momentum amplifies the effective step size by roughly $1/(1-\gamma)$ . So when you turn momentum on, you must shrink the LR you used without it. This is the most common beginner trap.
Fig 1 right panel: the same valley, but momentum’s path is “straightened” — perpendicular oscillation cancels, longitudinal velocity accumulates.

Nesterov accelerated gradient (NAG): peek before you leap#
Classical momentum overshoots near the minimum: it computes the gradient at the current point, so it only learns “oops, I went too far” one step too late.
$$v_t = \gamma v_{t-1} + \eta\,\nabla J(\theta_t - \gamma v_{t-1}),\qquad \theta_{t+1} = \theta_t - v_t.$$The only difference is where you evaluate the gradient: classical momentum at $\theta_t$ , NAG at the lookahead point $\theta_t - \gamma v_{t-1}$ — i.e. “where the momentum step alone would have taken me”.
Why it works: it is a one-step look-ahead. If the slope is about to flatten, NAG sees that early and decelerates; the converse for steepening. Nesterov (1983) proved this accelerates convex smooth optimization from $O(1/t)$ to $O(1/t^2)$ .

AdaGrad: every coordinate gets its own learning rate#
By 2011, NLP was drowning in sparse features — think word2vec where a rare word might appear 5 times in a million examples. With a single $\eta$ for everything:
- Rare-word parameters: small gradients, but the same $\eta$ is either too big (kills them) or too small (they never learn).
- Frequent-word parameters: big and frequent gradients, would prefer smaller steps.
Intuition: large accumulated $g^2$ -> large denominator -> small effective step. Rare-but-suddenly-large coordinate -> small denominator -> large effective step. LR is auto-distributed by frequency.
The fatal flaw: $G_t$ is a monotonically growing sum. Train deep nets for hundreds of thousands of steps and the denominator drives every effective LR toward zero. The model “suffocates”. The right panel of Fig 3 makes this concrete.

RMSProp: replace cumulative sum with EMA#
$$E[g^2]_t = \rho\,E[g^2]_{t-1} + (1-\rho)\,g_t^2$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t}+\epsilon}\,g_t.$$Typical $\rho = 0.9$ — “remember roughly the last 10 steps of gradient magnitude”.
The crucial difference:
- AdaGrad: $G_t$ only ever grows -> LR only ever shrinks (irreversible).
- RMSProp: $E[g^2]_t$ is a finite-window average -> when gradient magnitude changes, the denominator follows -> the LR can scale back up.
Fig 4 right panel shows this directly: at step 60 the gradient magnitude drops sharply. AdaGrad’s effective LR keeps falling; RMSProp’s effective LR climbs back to match the new regime.

Adam: stitch momentum and RMSProp together#
By now both threads were mature:
- Momentum gives a good direction.
- RMSProp gives a good per-coordinate scale.
Defaults: $\beta_1 = 0.9,\ \beta_2 = 0.999,\ \epsilon = 10^{-8}$ .
Why $\beta_2$ is much larger than $\beta_1$ : variance estimates are noisier than mean estimates and need a longer averaging window. $1/(1-0.999) = 1000$ steps — and that is exactly why Adam typically needs ~1000 warmup steps before $v_t$ “warms up”.

AdamW: the weight-decay bug that lived for a decade#
Adding L2 regularization $\frac{\lambda}{2}\|\theta\|^2$ to the loss adds a term $\lambda\theta$ to the gradient. In SGD this is exactly equivalent to multiplying weights by $(1-\eta\lambda)$ each step — the classical “weight decay”.
But Loshchilov & Hutter (2017) noticed that in Adam these two operations are no longer equivalent. The reason is direct: Adam divides the gradient by $\sqrt{\hat v_t}$ . If you fold $\lambda\theta$ into the gradient, it also gets divided by $\sqrt{\hat v_t}$ — meaning parameters with large gradient history get less weight decay, which is the opposite of what regularization wants.
$$\theta_{t+1} = \theta_t - \eta\,\frac{\hat m_t}{\sqrt{\hat v_t}+\epsilon} - \eta\lambda\,\theta_t.$$The effect: at the same $\lambda$ and LR, AdamW’s generalization gap on ImageNet/Transformer is meaningfully smaller than Adam+L2. This is why post-2018 every large-model pretrain defaults to AdamW.

The post-2023 frontier: three directions that scaled#
After AdamW reigned for ~6 years, three directions have actually proven themselves at scale since 2023.
Lion (Google, 2023): only the sign#
$$m_t = \beta_2 m_{t-1} + (1-\beta_2)\,g_t$$ $$\theta_{t+1} = \theta_t - \eta\,\mathrm{sign}\bigl(\beta_1 m_{t-1} + (1-\beta_1)\,g_t\bigr).$$Notable properties:
- Half the optimizer state: no $v_t$ needed — meaningful real money for hundred-billion-parameter models.
- Constant update magnitude $\eta$ : because sign returns $\pm 1$ . So Lion’s LR must be about 10x smaller than AdamW’s, and wd about 10x larger.
- On ViT and LLM pretraining, matches or slightly beats AdamW with faster wall-clock.
Sophia (Stanford, 2023): cheap second-order#
$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\,g_t$$ $$h_t \approx \mathrm{diag}(H_t) \quad\text{(Hutchinson estimate, every } k \text{ steps)}$$ $$\theta_{t+1} = \theta_t - \eta\,\mathrm{clip}\!\left(\frac{m_t}{\max(\gamma h_t,\,\varepsilon)},\,1\right).$$Core tricks:
- Use $\mathrm{diag}(H)$ instead of $g^2$ as the denominator — that is the actual curvature.
- The
clipis essential: $h_t$ can be negative in non-convex losses, and clipping keeps the update bounded. - The Hessian probe runs only every $k$ steps, so amortized cost is modest.
Reported results: roughly halves the wall-clock to reach a given perplexity at GPT-2 scale.
Schedule-Free (Meta, 2024): drop the schedule#
LR schedules (cosine, WSD, etc.) all share one annoyance: you must know the total step count in advance. During research you usually do not, so committing to a schedule ties your hands.
$$y_t = (1-\beta) z_t + \beta x_t \quad\text{(point at which the gradient is taken)}$$ $$z_{t+1} = z_t - \eta\,\nabla J(y_t)$$ $$x_{t+1} = (1-c_t)\,x_t + c_t\,z_{t+1} \quad\text{(returned "averaged" parameters)}$$The result: matches the final performance of cosine schedules without any explicit decay, and can be extended mid-training without redesigning anything.

Selection guide#
| Setting | Recommendation | Why |
|---|---|---|
| Convex / simple regression | GD or SGD-momentum | Strong theory, easy tuning |
| CV baseline | SGD + Nesterov + cosine | Historically best on ResNet/CNN |
| Transformer / LLM pretraining | AdamW + warmup + cosine/WSD | Industry default; close to free lunch |
| Memory-constrained large models | Lion | Saves the 1st-moment-equivalent state (~1/3 memory) |
| Research, unknown training length | Schedule-Free AdamW | Extend mid-run, no schedule redesign |
| Chasing wall-clock SOTA | Sophia | 2nd-order acceleration, but engineering cost |
Five facts that get missed most often#
- If you turn momentum on, lower the LR. Momentum amplifies the effective step by roughly $1/(1-\gamma)$ . With $\gamma=0.9$ that is ~10x.
- Adam’s $\beta_2 = 0.999$ implies a ~1000-step warmup because $v_t$ has not “warmed up” before that.
- AdamW’s wd is decoupled from the LR. When the LR scheduler decays the LR, wd does NOT decay with it. This is the fundamental difference from the old SGD+L2 workflow.
- Lion’s LR must be ~10x smaller than AdamW’s. Copy-pasting AdamW’s
3e-4will diverge immediately. - Second-order methods looked “permanently impractical” not because they are bad, but because Hessians used to be too expensive. Sophia broke that wall by combining $\mathrm{diag}(H)$ with cheap Hutchinson estimation.
Optimizer state memory: the cost the math hides#
The clean derivation of momentum, AdaGrad, RMSProp, and Adam never mentions VRAM. In production it is the dominant constraint. For a model with $P$ trainable parameters in fp16:
| Optimizer | Per-param state (fp32) | For 7 B params | For 70 B params |
|---|---|---|---|
| SGD | 0 | 0 | 0 |
| SGD + momentum | 4 bytes | 28 GB | 280 GB |
| Adam / AdamW | 8 bytes ($m, v$ ) | 56 GB | 560 GB |
| Lion | 4 bytes ($m$ only) | 28 GB | 280 GB |
| Sophia | 8 bytes ($m, h$ ) | 56 GB | 560 GB |
| Adafactor | ~2 bytes (factored) | 14 GB | 140 GB |
8-bit AdamW (bnb) | 2 bytes | 14 GB | 140 GB |
This is the table that decides which optimizer you can actually run. A 7 B model in fp16 takes 14 GB for weights and 14 GB for gradients; AdamW adds another 56 GB on top, blowing past a single A100 80 GB. The same model with 8-bit AdamW or Adafactor fits comfortably. This is why the LLM-pretraining literature has an obsession with optimizer-state quantization that classical ML never had: the algorithm is fine, the memory is the bottleneck.
A practical rule of thumb: if optimizer state exceeds 1.5× model weight memory, you are likely going to want to either shard the optimizer (ZeRO-1), quantize it (bitsandbytes), or move to a state-lighter algorithm (Lion, Adafactor). The choice depends on whether your bottleneck is single-GPU memory or aggregate cluster memory.

Mixed precision: where the optimizer sees fp32#
A subtle point that bites first-time pretrainers: the optimizer almost always operates in fp32 even when the rest of training is fp16/bf16. The reason is that Adam’s exponential moving averages accumulate over thousands of steps. With $\beta_2 = 0.999$ , $v_t$ is a sum where the smallest contributors are $10^{-3}$ of the largest, easily below fp16’s representable range ($\sim 6 \times 10^{-5}$ to $\sim 6 \times 10^4$ ).
The standard recipe is:
- Forward + backward in fp16/bf16. Gradients are fp16/bf16.
- Cast gradients to fp32 before the optimizer step.
- Optimizer state ($m, v$ , weights) lives in fp32.
- After the update, cast weights back to fp16/bf16 for the next forward pass.
This is what PyTorch’s torch.amp and DeepSpeed do automatically. It also explains why the “memory cost” rows above are in fp32 bytes — even if your model is fp16, the optimizer is paying fp32 per param.
bf16 changes the calculus slightly: its dynamic range is wide enough that you can sometimes keep gradients in bf16 throughout, but the optimizer state is still fp32 in every implementation worth using. For LLM pretraining at scale this is non-negotiable.

Learning rate sanity ranges per optimizer (Transformer baseline)#
These are the intervals I reach for first when starting a Transformer-class run. Treat them as Bayesian priors, not finals.
| Optimizer | LR range | Notes |
|---|---|---|
| SGD + momentum | 0.01 – 0.5 | Linear warmup ~5 % of total steps |
| AdamW | 1e-4 – 6e-4 | Most LLMs land at 3e-4 ± 1.5x |
| Lion | 1e-5 – 6e-5 | Roughly 10× smaller than AdamW |
| Sophia | 1e-4 – 4e-4 | Also lower than AdamW |
| Schedule-Free AdamW | 1e-4 – 6e-4 | Same as AdamW; the schedule is what changes |
| Adafactor | rel. step 0.01 – 0.05 | Uses relative step; literal LR is meaningless |
These are first-shot priors. The real number depends on batch size (linear scaling rule for SGD, $\sqrt{}$ -ish scaling rule for Adam), warmup length, and dataset noise. But starting outside these intervals is almost always a configuration mistake, not a discovery.
Summary#
Three decades of optimizer evolution compress to two sentences:
- GD to Adam: first solve the direction problem (momentum), then the scale problem (AdaGrad / RMSProp), then merge them and fix the bias (Adam).
- Adam onwards: algorithmic improvement gives way to regularization detail (AdamW), memory efficiency (Lion), second-order information (Sophia), and schedule freedom (Schedule-Free).
If you only remember one thing: the LLM-era default is still AdamW + warmup + cosine/WSD + gradient clipping. Until you have a concrete bottleneck (memory, wall-clock, schedule flexibility), every paper claiming to beat AdamW deserves a baseline reproduction on your own task before you commit.
References#
- Adam: A Method for Stochastic Optimization (Kingma & Ba, 2014) — arXiv:1412.6980
- Decoupled Weight Decay Regularization (Loshchilov & Hutter, 2017) — arXiv:1711.05101
- Symbolic Discovery of Optimization Algorithms / Lion (Chen et al., 2023) — arXiv:2302.06675
- Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training (Liu et al., 2023) — arXiv:2305.14342
- The Road Less Scheduled / Schedule-Free (Defazio et al., 2024) — arXiv:2405.15682
- For the learning-rate side of training: Learning Rate: From Basics to Large-Scale Training
Optimization Theory 12 parts
- 01 Optimization (1): Convex Analysis Foundations
- 02 Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration
- 03 Optimization (3): The Gradient Descent Family from SGD to AdamW you are here
- 04 Optimization (4): Learning Rate and Schedules
- 05 Optimization (5): Acceleration Beyond Nesterov
- 06 Optimization (6): Composite Optimization and Proximal Methods
- 07 Optimization (7): Second-Order Methods
- 08 Optimization (8): Lagrangian Duality and KKT Conditions
- 09 Optimization (9): Interior-Point Methods and Self-Concordant Barriers
- 10 Optimization (10): Stochastic Optimization and Variance Reduction
- 11 Optimization (11): Non-Convex Optimization and Saddle Escape
- 12 Optimization (12): Discrete and Global Optimization