Transfer Learning on Chen Kai Blog

Transfer Learning (12): Industrial Applications and Best Practices

Sun, 06 Jul 2025 09:00:00 +0000

This is the final part of the series. The previous eleven parts gave you the mechanics – pretraining, fine-tuning, domain adaptation, few-shot and zero-shot learning, distillation, multi-task learning, multimodality, parameter-efficient methods, continual learning, and cross-lingual transfer. This part is about the work that happens once the notebook closes: deciding whether to use transfer learning, how to thread it into a production pipeline, and how to know it is still working six months later.

Transfer Learning (11): Cross-Lingual Transfer

Mon, 30 Jun 2025 09:00:00 +0000

English has the labels. The world has 7,000+ languages. Cross-lingual transfer is what lets a sentiment classifier trained only on English IMDB reviews score Spanish tweets, what makes a question-answering model fine-tuned on SQuAD answer Hindi questions, and what allows a model that has never seen a single labeled Swahili sentence to do passable Swahili NER.

This post derives why that is even possible. We start from the bilingual-embedding alignment that motivated the field, walk through the multilingual pretraining recipe (mBERT, XLM-R) that made parallel data optional, and end with the practical playbook – zero-shot vs translate-train vs translate-test, when to pick which, and where the wheels come off.

Transfer Learning (10): Continual Learning

Tue, 24 Jun 2025 09:00:00 +0000

You can teach yourself to play guitar this year and you will still remember how to ride a bike. A neural network cannot. Fine-tune a vision model on CIFAR-10 then on SVHN, evaluate it on CIFAR-10 again, and accuracy collapses to barely above chance. The phenomenon is called catastrophic forgetting, and overcoming it is the central problem of continual learning (CL): a learner that absorbs a stream of tasks $\mathcal{T}_1, \mathcal{T}_2, \ldots$ without re-accessing past data and without losing what it already knew.

Transfer Learning (9): Parameter-Efficient Fine-Tuning

Wed, 18 Jun 2025 09:00:00 +0000

How do you fine-tune a 175B-parameter model on a single GPU? Update only 0.1% of the parameters. Parameter-Efficient Fine-Tuning (PEFT) makes this possible – and on most benchmarks it matches full fine-tuning. This post derives the math behind LoRA, Adapter, Prefix-Tuning, Prompt-Tuning, BitFit and QLoRA, and gives you a single picture for choosing among them.

What You Will Learn

Why the low-rank assumption holds for weight updates
LoRA: derivation, initialization, scaling, and weight merging
Adapter: bottleneck architecture and where to insert it
Prefix-Tuning vs Prompt-Tuning vs P-Tuning v2
QLoRA: how 4-bit quantisation gets a 65B model on one GPU
Method comparison and a selection guide grounded in GLUE numbers

Prerequisites

Transformer architecture (attention, FFN, residual + LayerNorm)
Matrix decomposition basics (rank, SVD)
Transfer learning fundamentals (Parts 1-6)

The Full Fine-Tuning Problem

Full fine-tuning updates every parameter $\boldsymbol{\theta}$:

Transfer Learning (8): Multimodal Transfer

Thu, 12 Jun 2025 09:00:00 +0000

How can a model classify an image of a Burmese cat correctly without ever having seen a label “Burmese cat”? Traditional supervised learning needs millions of labeled examples per class. CLIP, released by OpenAI in 2021, sidesteps that constraint entirely: it learns to put images and natural-language descriptions into the same vector space, and then “classification” reduces to picking which sentence — out of any candidate sentences you write down — sits closest to the image.

Transfer Learning (7): Zero-Shot Learning

Fri, 06 Jun 2025 09:00:00 +0000

You have never seen a zebra. I tell you it looks like a horse painted with black and white stripes, and the next time one walks into the zoo you recognise it instantly. No labelled examples, no fine-tuning — only a semantic bridge between what you know (horses, stripes) and what you don’t (this new species).

Zero-shot learning (ZSL) is the machine-learning version of that trick. Train on a set of seen classes for which you have labelled images. At test time, classify into a disjoint set of unseen classes that you have never shown the model — using only a description of what those classes are: a list of attributes, a word embedding of the class name, a sentence, or an image-text contrastive prompt. The model’s only handle on the unseen classes is the geometry it has learned in a shared visual–semantic space.

Transfer Learning (6): Multi-Task Learning

Sat, 31 May 2025 09:00:00 +0000

A self-driving car looking through a single camera needs to do three things at once: detect cars and pedestrians, segment lanes and free space, and estimate how far away each pixel is. You could train three separate networks. You would burn 3x the parameters, run 3x the forward passes at inference, and ignore the obvious fact that all three tasks need the same kind of low-level features (edges, surfaces, occlusion cues).

Transfer Learning (5): Knowledge Distillation

Sun, 25 May 2025 09:00:00 +0000

You have a 340M-parameter BERT model that hits 95% accuracy. The product team wants it on a phone that can barely fit 10M parameters. Training a 10M model from scratch lands at 85%. Knowledge distillation closes most of the gap: train the small model on the output distribution of the large one, not just on the labels, and you can reach 92%.

The key insight, due to Hinton, is that a teacher’s “wrong” predictions are not noise – they are information. When the teacher classifies a cat image and assigns 0.14 to “tiger”, 0.07 to “dog”, and 0.008 to “plane”, it is telling you that cats look a lot like tigers, somewhat like dogs, and nothing like aeroplanes. That structure – dark knowledge – is invisible in a one-hot label, and learning it is what lets the student punch above its weight.

Transfer Learning (4): Few-Shot Learning

Mon, 19 May 2025 09:00:00 +0000

Show a child one photograph of a pangolin and they will spot pangolins for life. Show a deep learning model one photograph and it will give you a uniformly random guess. Few-shot learning is the field that closes that gap: building classifiers that work with only one to ten labeled examples per class.

The trick is not to memorize individual classes harder. It is to learn how to learn from very few examples, then carry that ability over to brand-new classes at test time. This article covers the two families that dominate the field today: metric learning, which learns a good distance function, and meta-learning, which learns a good initialization.

Transfer Learning (3): Domain Adaptation

Tue, 13 May 2025 09:00:00 +0000

Your autonomous-driving stack works perfectly on sunny California freeways. Then it rains in Seattle. Top-1 accuracy drops from 95% to 70%. The model did not get worse — the data distribution shifted, and your training set never told it what wet asphalt looks like at dusk.

This is the everyday problem of domain adaptation: you have abundant labelled data in one distribution (the source) and unlabelled data in another (the target), and you need the model to perform on the target. This article shows you how, from first-principles theory to a working DANN implementation.

Transfer Learning (2): Pre-training and Fine-tuning

Wed, 07 May 2025 09:00:00 +0000

BERT changed NLP overnight. A model pre-trained on Wikipedia and BookCorpus could be fine-tuned on a few thousand labelled examples and beat task-specific architectures that researchers had spent years hand-crafting. The same pattern repeated in vision (ImageNet pre-training, then SimCLR, MAE), in speech (wav2vec 2.0), and in code (Codex). Today, “pre-train once, fine-tune everywhere” is the default recipe of modern deep learning.

But why does pre-training work? When should you freeze layers, when should you LoRA, and how small does your learning rate need to be? This article unpacks both the theory and the engineering practice behind the most successful transfer paradigm we have.

Transfer Learning (1): Fundamentals and Core Concepts

Thu, 01 May 2025 09:00:00 +0000

You spent two weeks training an ImageNet classifier on a rack of GPUs. On Monday morning your team lead asks for a chest-X-ray pneumonia model – and the entire labelled dataset is two hundred images. Do you book another two weeks of GPU time and start from scratch?

Of course not. You take what the ImageNet model already knows about edges, textures and shapes, swap out the last layer, and fine-tune on the X-rays. Two hours later you have a model that beats anything you could have trained from random weights with so little data. That is transfer learning, and it is the reason most real-world deep-learning projects ship in days instead of months.