Tagged

Multimodal

Feb 27, 2026 Aliyun Bailian 5 min read

Aliyun Bailian (3): Qwen-Omni for Video, Audio, and Image Understanding

Qwen-Omni for production multimodal: the four input types, the streaming requirement that the docs do not warn you about, and a working video-understanding pipeline with sane pixel budgets.

Nov 20, 2025 NLP 17 min read

NLP (11): Multimodal Large Language Models

A deep dive into multimodal LLMs: contrastive vision-language pre-training with CLIP, parameter-efficient bridging with BLIP-2's Q-Former, visual instruction tuning with LLaVA, robust speech recognition with Whisper, …

Aug 5, 2022 Standalone 21 min read

Multimodal LLMs and Downstream Tasks: A Practitioner's Guide

End-to-end map of multimodal LLMs: vision-language alignment, cross-modal fusion, the CLIP/BLIP/LLaVA families, downstream tasks (VQA, captioning, grounding, OCR), fine-tuning trade-offs, benchmarks, and what it takes to …