Aliyun Bailian (3): Qwen-Omni for Video, Audio, and Image Understanding
Qwen-Omni for production multimodal: the four input types, the streaming requirement that the docs do not warn you about, and a working video-understanding pipeline with sane pixel budgets.
Of all the Bailian models, Qwen-Omni is the one that has pulled me out of the most product-roadmap holes. “Can you tell me what’s happening in this 2-minute promo video?” used to be a 3-week project involving frame extraction, captioning per frame, and a stitch step. With Qwen-Omni it is one HTTP request. But the docs are sparse on the gotchas, and there is one (streaming is mandatory) that has cost more than one team a half-day. Let’s not have that be you.
What Qwen-Omni accepts
Per the Qwen API reference for multimodal models, a single user message can mix text, image, audio, and video parts in one content array. That is the headline capability — not “supports images”, but “supports anything in any combination”:

The structure for each type, drawn from the API reference:
| Type | Field | Notes |
|---|---|---|
text | text: "..." | Plain string. |
image_url | image_url: {url} | URL or base64 data URI. min_pixels / max_pixels control resize. |
input_audio | data, format | mp3, wav, etc. URL or local base64. |
video_url | video_url: {url} | URL or data URI. Or use video array of frame images. |
A real call:
| |
Streaming is not optional — this is the trap
The docs note streaming as a feature, but they bury the fact that for Qwen-Omni it is required. Set stream=False and you get a 400 with a message about the model requiring streaming.

The reason makes sense once you think about it: the model is processing many MB of video and producing a long response. The wire protocol assumes incremental delivery. Holding the whole thing back to send as one blob would block your client for tens of seconds with no progress signal.
If your downstream code expects a complete string, buffer the deltas yourself:
| |
It is one extra function. You will write it once.
Pixel and frame budgets — what costs you
The cost knob the docs are quiet about: min_pixels and max_pixels for images, and the equivalent fps / resize parameters for video. By default Qwen-Omni will process video at native resolution and a default fps. For a 2-minute 1080p clip that is a lot of token-equivalents, and the bill scales with it.
What I do in production:
- Images for understanding tasks —
max_pixels: 1280*720. Almost no quality loss for “what’s in this image” tasks, big cost savings. Setmin_pixels: 640*480so the model never scales up tiny crops. - Video for description tasks — pre-resize to 720p before upload, downsample fps to 4 for static-ish content (people talking) or 8 for action content (sports, fast cuts). Above 8 fps you are usually paying for redundant frames.
- Long video — chunk it. The model has a context limit. For anything over ~3 minutes, split into 90-second chunks, summarize each, then summarize-the-summaries with
qwen-plus. Same pattern as long-document RAG.
Sending a local video file
You have two options. The docs cover both.

Path 1 (preferred): upload to OSS, send signed URL.
| |
Then pass signed as the url field. This is the right answer for anything bigger than 30 seconds because base64 inflates the payload by 33%.
Path 2: base64 inline. Useful for short clips, avoids the round trip to OSS.
| |
Real-world tip: When debugging a Qwen-Omni 400, check the URL is publicly fetchable from the open internet. The model service does not have access to your VPC. Signed URLs work; private OSS without signing does not.
Audio understanding
Almost the same shape, but type: "input_audio":
| |
For pure transcription, Bailian also exposes a dedicated Paraformer ASR model that is cheaper. Use Paraformer for transcribe-only and Qwen-Omni when you need understanding (sentiment, summarization, “did the call mention pricing”).
A real product use case
The recurring pattern I ship for AI marketing: a creative team uploads a 60-second product video; we want a structured caption (scene description, key product features visible, target audience guess, recommended music style). One Qwen-Omni call, JSON mode (yes it works with multimodal), under 4 seconds wall clock for 720p input.
| |
What’s next
Article 4 jumps to the production side — Wanxiang text-to-video. That is async-only, native-protocol-only, and the failure modes are completely different (queue depth, output URL expiry). It is also the API I have spent the most time tuning prompts for.