CLIP on Chen Kai Blog

NLP (11): Multimodal Large Language Models

Thu, 20 Nov 2025 09:00:00 +0000

Humans never perceive the world in one channel at a time. We watch a chart while reading the caption, hear a tone of voice while reading a face, glance at a screenshot while debating a bug. Pure-text language models are deaf and blind to all of that. Multimodal Large Language Models (MLLMs) close the gap by aligning images, audio, and video into the same representation space the language model already speaks.

Transfer Learning (8): Multimodal Transfer

Thu, 12 Jun 2025 09:00:00 +0000

How can a model classify an image of a Burmese cat correctly without ever having seen a label “Burmese cat”? Traditional supervised learning needs millions of labeled examples per class. CLIP, released by OpenAI in 2021, sidesteps that constraint entirely: it learns to put images and natural-language descriptions into the same vector space, and then “classification” reduces to picking which sentence — out of any candidate sentences you write down — sits closest to the image.

Transfer Learning (7): Zero-Shot Learning

Fri, 06 Jun 2025 09:00:00 +0000

You have never seen a zebra. I tell you it looks like a horse painted with black and white stripes, and the next time one walks into the zoo you recognise it instantly. No labelled examples, no fine-tuning — only a semantic bridge between what you know (horses, stripes) and what you don’t (this new species).

Zero-shot learning (ZSL) is the machine-learning version of that trick. Train on a set of seen classes for which you have labelled images. At test time, classify into a disjoint set of unseen classes that you have never shown the model — using only a description of what those classes are: a list of attributes, a word embedding of the class name, a sentence, or an image-text contrastive prompt. The model’s only handle on the unseen classes is the geometry it has learned in a shared visual–semantic space.