Transfer Learning (5): Knowledge Distillation

Sun, 25 May 2025 09:00:00 +0000

You have a 340M-parameter BERT model that hits 95% accuracy. The product team wants it on a phone that can barely fit 10M parameters. Training a 10M model from scratch lands at 85%. Knowledge distillation closes most of the gap: train the small model on the output distribution of the large one, not just on the labels, and you can reach 92%.

The key insight, due to Hinton, is that a teacher’s “wrong” predictions are not noise — they are information. When the teacher classifies a cat image and assigns 0.14 to “tiger”, 0.07 to “dog”, and 0.008 to “plane”, it is telling you that cats look a lot like tigers, somewhat like dogs, and nothing like aeroplanes. That structure — dark knowledge — is invisible in a one-hot label, and learning it is what lets the student punch above its weight.

Self-Distillation on Chen Kai Blog

Transfer Learning (5): Knowledge Distillation