Optimization Theory

From convex analysis to non-convex landscapes — first-order, second-order, constrained, stochastic, and combinatorial optimization with complete proofs.

12 articles

  1. 01

    Optimization (1): Convex Analysis Foundations

    The geometric and analytic toolkit that unlocks the rest of the series: convex sets, convex functions, the conjugate …

    26 min
  2. 02

    Optimization (2): Smoothness, Strong Convexity, and Nesterov Acceleration

    Three concepts that demystify most of optimization: Lipschitz smoothness fixes the maximum step size, strong convexity …

    28 min
  3. 03

    Optimization (3): The Gradient Descent Family from SGD to AdamW

    One article that traces the full lineage GD -> SGD -> Momentum -> NAG -> AdaGrad -> RMSProp -> Adam -> AdamW, then …

    24 min
  4. 04

    Optimization (4): Learning Rate and Schedules

    A practitioner's guide to the single most important hyperparameter: why too-large LR explodes, how warmup and schedules …

    40 min
  5. 05

    Optimization (5): Acceleration Beyond Nesterov

    What does it really mean for a first-order method to be optimal? We prove a tight lower bound matching Nesterov's rate, …

    26 min
  6. 06

    Optimization (6): Composite Optimization and Proximal Methods

    A systematic walk through the proximal operator: convex-analysis basics, the Moreau envelope, closed-form proxes, and …

    32 min
  7. 07

    Optimization (7): Second-Order Methods

    Second-order methods break the sqrt(kappa) barrier by using curvature. We prove Newton's quadratic local convergence, …

    24 min
  8. 08

    Optimization (8): Lagrangian Duality and KKT Conditions

    How constraints become prices: the Lagrangian, weak duality, Slater's condition for strong duality, the KKT system as …

    20 min
  9. 09

    Optimization (9): Interior-Point Methods and Self-Concordant Barriers

    How interior-point methods became the default solver for convex programming: replace inequalities with a logarithmic …

    22 min
  10. 10

    Optimization (10): Stochastic Optimization and Variance Reduction

    Why does SGD work? We prove the O(1/sqrt(T)) convex rate and the O(1/(mu T)) strongly convex rate from the gradient …

    20 min
  11. 11

    Optimization (11): Non-Convex Optimization and Saddle Escape

    Why does SGD work for training neural networks despite the non-convex landscape? We prove perturbed GD escapes strict …

    22 min
  12. 12

    Optimization (12): Discrete and Global Optimization

    When variables are integer-valued or the problem is non-convex with multiple basins, classical convex methods stop …

    38 min