Multimodal on Chen Kai Blog

Aliyun Bailian (3): Qwen-Omni for Video, Audio, and Image Understanding

Fri, 27 Feb 2026 09:00:00 +0000

Of all the Bailian models, Qwen-Omni has saved me the most from product-roadmap issues. “Can you tell me what’s happening in this 2-minute promo video?” used to take 3 weeks, involving frame extraction, captioning each frame, and stitching them together. With Qwen-Omni, it’s just one HTTP request. However, the documentation lacks details on some pitfalls, such as the requirement for streaming, which has cost more than one team a half-day. Let’s avoid that for you.

NLP (11): Multimodal Large Language Models

Thu, 20 Nov 2025 09:00:00 +0000

Humans never perceive the world in one channel at a time. We watch a chart while reading the caption, hear a tone of voice while reading a face, glance at a screenshot while debating a bug. Pure-text language models are deaf and blind to all of that. Multimodal Large Language Models (MLLMs) close the gap by aligning images, audio, and video into the same representation space the language model already speaks.

Multimodal LLMs and Downstream Tasks: A Practitioner's Guide

Sat, 09 Apr 2022 09:00:00 +0000

Stuffing pixels, audio, and video into a language model so it can “see,” “hear,” and reason — that was a research curiosity before CLIP landed in 2021. Today it’s table stakes for most consumer-facing AI products. But shipping a Multimodal LLM (MLLM) in production turns out to be hard in places people rarely talk about. Almost never the vision encoder. Almost always these four:

Alignment. How does the language model “understand” what the vision encoder produces? Is the projector a 2-layer MLP or a Q-Former? Which parameters thaw during training?
Task framing. The same MLLM has to do captioning, VQA, grounding, OCR. Each needs a prompt template that doesn’t quietly drop several points of accuracy.
Cost. A 1024x1024 image becomes hundreds of visual tokens. Prefill is brutal. Stretch that to video and the bill goes vertical. Token compression, KV cache reuse, and batching are not optional.
Evaluation. A model that scores 80 on MMBench can still hallucinate confidently on your customer’s invoice. Public benchmarks are the easy part.

This post follows the natural research arc — architecture, model families, downstream tasks, fine-tuning, evaluation, deployment — and tries to be specific enough at each stop that you can act on it. Less “what’s possible,” more “what to actually pick.”