<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Deep Learning Theory on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/deep-learning-theory/</link><description>Recent content in Deep Learning Theory on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 29 Sep 2022 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/deep-learning-theory/index.xml" rel="self" type="application/rss+xml"/><item><title>Optimization (11): Non-Convex Optimization and Saddle Escape</title><link>https://www.chenk.top/en/optimization-theory/11-nonconvex-saddle-escape/</link><pubDate>Thu, 29 Sep 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/optimization-theory/11-nonconvex-saddle-escape/</guid><description>&lt;p>For non-convex &lt;span class="math-inline">$f$&lt;/span>
, gradient descent has no global guarantee. The best we can say is that &lt;span class="math-inline">$\nabla f(x_t) \to 0$&lt;/span>
 — we converge to a stationary point, which could be a local min, a saddle, or even a local max. This article asks: when can we say more?&lt;/p>
&lt;p>Three positive results:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Saddle escape&lt;/strong>: under a &amp;ldquo;strict saddle&amp;rdquo; assumption, perturbed GD converges to local minima in polynomial time. Saddle points are unstable; Brownian noise (or just numerical perturbation) escapes them.&lt;/li>
&lt;li>&lt;strong>PL condition&lt;/strong>: a relaxation of strong convexity that holds in over-parameterized neural networks. Under PL, vanilla GD gets the linear rate &lt;span class="math-inline">$O(\log(1/\epsilon))$&lt;/span>
 even without convexity.&lt;/li>
&lt;li>&lt;strong>Loss landscape facts&lt;/strong>: for sufficiently wide neural networks, all local minima are global, and SGD&amp;rsquo;s noise gives implicit bias toward flat minima with better generalization.&lt;/li>
&lt;/ol>
&lt;p>Each is rigorous in its setting. The article also discusses what is &lt;strong>not&lt;/strong> known — there is no general theorem saying &amp;ldquo;SGD finds the global optimum of a deep network.&amp;rdquo;&lt;/p></description></item></channel></rss>