NLP (11): Multimodal Large Language Models

Thu, 20 Nov 2025 09:00:00 +0000

Humans never perceive the world in one channel at a time. We watch a chart while reading the caption, hear a tone of voice while reading a face, glance at a screenshot while debating a bug. Pure-text language models are deaf and blind to all of that. Multimodal Large Language Models (MLLMs) close the gap by aligning images, audio, and video into the same representation space the language model already speaks.

BLIP-2 on Chen Kai Blog

NLP (11): Multimodal Large Language Models