<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Multimodal on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/multimodal/</link><description>Recent content in Multimodal on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Fri, 27 Feb 2026 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/multimodal/index.xml" rel="self" type="application/rss+xml"/><item><title>Aliyun Bailian (3): Qwen-Omni for Video, Audio, and Image Understanding</title><link>https://www.chenk.top/en/aliyun-bailian/03-qwen-omni-multimodal/</link><pubDate>Fri, 27 Feb 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/aliyun-bailian/03-qwen-omni-multimodal/</guid><description>&lt;p>Of all the Bailian models, Qwen-Omni has saved me the most from product-roadmap issues. &amp;ldquo;Can you tell me what&amp;rsquo;s happening in this 2-minute promo video?&amp;rdquo; used to take 3 weeks, involving frame extraction, captioning each frame, and stitching them together. With Qwen-Omni, it&amp;rsquo;s just one HTTP request. However, the documentation lacks details on some pitfalls, such as the requirement for streaming, which has cost more than one team a half-day. Let&amp;rsquo;s avoid that for you.&lt;/p></description></item><item><title>NLP (11): Multimodal Large Language Models</title><link>https://www.chenk.top/en/nlp/multimodal-nlp/</link><pubDate>Thu, 20 Nov 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/nlp/multimodal-nlp/</guid><description>&lt;p>Humans never perceive the world in one channel at a time. We watch a chart while reading the caption, hear a tone of voice while reading a face, glance at a screenshot while debating a bug. Pure-text language models are deaf and blind to all of that. &lt;strong>Multimodal Large Language Models (MLLMs)&lt;/strong> close the gap by aligning images, audio, and video into the same representation space the language model already speaks.&lt;/p></description></item><item><title>Multimodal LLMs and Downstream Tasks: A Practitioner's Guide</title><link>https://www.chenk.top/en/standalone/multimodal-llm-downstream-tasks/</link><pubDate>Sat, 09 Apr 2022 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/standalone/multimodal-llm-downstream-tasks/</guid><description>&lt;p>Stuffing pixels, audio, and video into a language model so it can &amp;ldquo;see,&amp;rdquo; &amp;ldquo;hear,&amp;rdquo; and reason — that was a research curiosity before CLIP landed in 2021. Today it&amp;rsquo;s table stakes for most consumer-facing AI products. But shipping a Multimodal LLM (MLLM) in production turns out to be hard in places people rarely talk about. Almost never the vision encoder. Almost always these four:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Alignment.&lt;/strong> How does the language model &amp;ldquo;understand&amp;rdquo; what the vision encoder produces? Is the projector a 2-layer MLP or a Q-Former? Which parameters thaw during training?&lt;/li>
&lt;li>&lt;strong>Task framing.&lt;/strong> The same MLLM has to do captioning, VQA, grounding, OCR. Each needs a prompt template that doesn&amp;rsquo;t quietly drop several points of accuracy.&lt;/li>
&lt;li>&lt;strong>Cost.&lt;/strong> A 1024x1024 image becomes hundreds of visual tokens. Prefill is brutal. Stretch that to video and the bill goes vertical. Token compression, KV cache reuse, and batching are not optional.&lt;/li>
&lt;li>&lt;strong>Evaluation.&lt;/strong> A model that scores 80 on MMBench can still hallucinate confidently on your customer&amp;rsquo;s invoice. Public benchmarks are the easy part.&lt;/li>
&lt;/ol>
&lt;p>This post follows the natural research arc — architecture, model families, downstream tasks, fine-tuning, evaluation, deployment — and tries to be specific enough at each stop that you can act on it. Less &amp;ldquo;what&amp;rsquo;s possible,&amp;rdquo; more &amp;ldquo;what to actually pick.&amp;rdquo;&lt;/p></description></item></channel></rss>