Transfer Learning (8): Multimodal Transfer

Thu, 12 Jun 2025 09:00:00 +0000

How can a model classify an image of a Burmese cat correctly without ever having seen a label “Burmese cat”? Traditional supervised learning needs millions of labeled examples per class. CLIP, released by OpenAI in 2021, sidesteps that constraint entirely: it learns to put images and natural-language descriptions into the same vector space, and then “classification” reduces to picking which sentence — out of any candidate sentences you write down — sits closest to the image.

Multimodal Learning on Chen Kai Blog

Transfer Learning (8): Multimodal Transfer