<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Multimodal Learning on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/multimodal-learning/</link><description>Recent content in Multimodal Learning on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 12 Jun 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/multimodal-learning/index.xml" rel="self" type="application/rss+xml"/><item><title>Transfer Learning (8): Multimodal Transfer</title><link>https://www.chenk.top/en/transfer-learning/08-multimodal-transfer/</link><pubDate>Thu, 12 Jun 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/transfer-learning/08-multimodal-transfer/</guid><description>&lt;p>How can a model classify an image of a Burmese cat correctly without ever having seen a label &amp;ldquo;Burmese cat&amp;rdquo;? Traditional supervised learning needs millions of labeled examples per class. CLIP, released by OpenAI in 2021, sidesteps that constraint entirely: it learns to put images and natural-language descriptions into the same vector space, and then &amp;ldquo;classification&amp;rdquo; reduces to picking which sentence — out of any candidate sentences you write down — sits closest to the image.&lt;/p></description></item></channel></rss>