<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Self-Distillation on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/self-distillation/</link><description>Recent content in Self-Distillation on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sun, 25 May 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/self-distillation/index.xml" rel="self" type="application/rss+xml"/><item><title>Transfer Learning (5): Knowledge Distillation</title><link>https://www.chenk.top/en/transfer-learning/05-knowledge-distillation/</link><pubDate>Sun, 25 May 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/transfer-learning/05-knowledge-distillation/</guid><description>&lt;p>You have a 340M-parameter BERT model that hits 95% accuracy. The product team wants it on a phone that can barely fit 10M parameters. Training a 10M model from scratch lands at 85%. Knowledge distillation closes most of the gap: train the small model on the &lt;em>output distribution&lt;/em> of the large one, not just on the labels, and you can reach 92%.&lt;/p>
&lt;p>The key insight, due to Hinton, is that a teacher&amp;rsquo;s &amp;ldquo;wrong&amp;rdquo; predictions are not noise — they are information. When the teacher classifies a cat image and assigns 0.14 to &amp;ldquo;tiger&amp;rdquo;, 0.07 to &amp;ldquo;dog&amp;rdquo;, and 0.008 to &amp;ldquo;plane&amp;rdquo;, it is telling you that cats look a lot like tigers, somewhat like dogs, and nothing like aeroplanes. That structure — &lt;strong>dark knowledge&lt;/strong> — is invisible in a one-hot label, and learning it is what lets the student punch above its weight.&lt;/p></description></item></channel></rss>