Tagged
Multimodal
Aliyun Bailian (3): Qwen-Omni for Video, Audio, and Image Understanding
Qwen-Omni for production multimodal: the four input types, the streaming requirement that the docs do not warn you about, and a working video-understanding pipeline with sane pixel budgets.
NLP (11): Multimodal Large Language Models
A deep dive into multimodal LLMs: contrastive vision-language pre-training with CLIP, parameter-efficient bridging with BLIP-2's Q-Former, visual instruction tuning with LLaVA, robust speech recognition with Whisper, …
Multimodal LLMs and Downstream Tasks: A Practitioner's Guide
End-to-end map of multimodal LLMs: vision-language alignment, cross-modal fusion, the CLIP/BLIP/LLaVA families, downstream tasks (VQA, captioning, grounding, OCR), fine-tuning trade-offs, benchmarks, and what it takes to …