<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>BLIP-2 on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/blip-2/</link><description>Recent content in BLIP-2 on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 20 Nov 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/blip-2/index.xml" rel="self" type="application/rss+xml"/><item><title>NLP (11): Multimodal Large Language Models</title><link>https://www.chenk.top/en/nlp/multimodal-nlp/</link><pubDate>Thu, 20 Nov 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/nlp/multimodal-nlp/</guid><description>&lt;p>Humans never perceive the world in one channel at a time. We watch a chart while reading the caption, hear a tone of voice while reading a face, glance at a screenshot while debating a bug. Pure-text language models are deaf and blind to all of that. &lt;strong>Multimodal Large Language Models (MLLMs)&lt;/strong> close the gap by aligning images, audio, and video into the same representation space the language model already speaks.&lt;/p></description></item></channel></rss>