Optimization (11): Non-Convex Optimization and Saddle Escape

Thu, 29 Sep 2022 09:00:00 +0000

For non-convex $$f$$ , gradient descent has no global guarantee. The best we can say is that $\nabla f(x_t) \to 0$ — we converge to a stationary point, which could be a local min, a saddle, or even a local max. This article asks: when can we say more?

Three positive results:

Saddle escape: under a “strict saddle” assumption, perturbed GD converges to local minima in polynomial time. Saddle points are unstable; Brownian noise (or just numerical perturbation) escapes them.
PL condition: a relaxation of strong convexity that holds in over-parameterized neural networks. Under PL, vanilla GD gets the linear rate $O(\log(1/\epsilon))$ even without convexity.
Loss landscape facts: for sufficiently wide neural networks, all local minima are global, and SGD’s noise gives implicit bias toward flat minima with better generalization.

Each is rigorous in its setting. The article also discusses what is not known — there is no general theorem saying “SGD finds the global optimum of a deep network.”

Deep Learning Theory on Chen Kai Blog

Optimization (11): Non-Convex Optimization and Saddle Escape