Tags
Deep Learning Theory
Optimization (11): Non-Convex Optimization and Saddle Escape
Why does SGD work for training neural networks despite the non-convex landscape? We prove perturbed GD escapes strict saddles in polynomial time, derive convergence under the Polyak-Lojasiewicz condition, and survey what …
