<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>CLIP on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/clip/</link><description>Recent content in CLIP on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 20 Nov 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/clip/index.xml" rel="self" type="application/rss+xml"/><item><title>NLP (11): Multimodal Large Language Models</title><link>https://www.chenk.top/en/nlp/multimodal-nlp/</link><pubDate>Thu, 20 Nov 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/nlp/multimodal-nlp/</guid><description>&lt;p>Humans never perceive the world in one channel at a time. We watch a chart while reading the caption, hear a tone of voice while reading a face, glance at a screenshot while debating a bug. Pure-text language models are deaf and blind to all of that. &lt;strong>Multimodal Large Language Models (MLLMs)&lt;/strong> close the gap by aligning images, audio, and video into the same representation space the language model already speaks.&lt;/p></description></item><item><title>Transfer Learning (8): Multimodal Transfer</title><link>https://www.chenk.top/en/transfer-learning/08-multimodal-transfer/</link><pubDate>Thu, 12 Jun 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/transfer-learning/08-multimodal-transfer/</guid><description>&lt;p>How can a model classify an image of a Burmese cat correctly without ever having seen a label &amp;ldquo;Burmese cat&amp;rdquo;? Traditional supervised learning needs millions of labeled examples per class. CLIP, released by OpenAI in 2021, sidesteps that constraint entirely: it learns to put images and natural-language descriptions into the same vector space, and then &amp;ldquo;classification&amp;rdquo; reduces to picking which sentence — out of any candidate sentences you write down — sits closest to the image.&lt;/p></description></item><item><title>Transfer Learning (7): Zero-Shot Learning</title><link>https://www.chenk.top/en/transfer-learning/07-zero-shot-learning/</link><pubDate>Fri, 06 Jun 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/transfer-learning/07-zero-shot-learning/</guid><description>&lt;p>You have never seen a zebra. I tell you it looks like a horse painted with black and white stripes, and the next time one walks into the zoo you recognise it instantly. No labelled examples, no fine-tuning — only a &lt;em>semantic bridge&lt;/em> between what you know (horses, stripes) and what you don&amp;rsquo;t (this new species).&lt;/p>
&lt;p>&lt;strong>Zero-shot learning (ZSL)&lt;/strong> is the machine-learning version of that trick. Train on a set of &lt;em>seen&lt;/em> classes for which you have labelled images. At test time, classify into a &lt;em>disjoint&lt;/em> set of &lt;em>unseen&lt;/em> classes that you have &lt;em>never&lt;/em> shown the model — using only a description of what those classes are: a list of attributes, a word embedding of the class name, a sentence, or an image-text contrastive prompt. The model&amp;rsquo;s only handle on the unseen classes is the geometry it has learned in a shared visual–semantic space.&lt;/p></description></item></channel></rss>