CLIP

Nov 20, 2025 NLP 32 min read

NLP (11): Multimodal Large Language Models

A deep dive into multimodal LLMs: contrastive vision-language pre-training with CLIP, parameter-efficient bridging with BLIP-2's Q-Former, visual instruction tuning with LLaVA, robust speech recognition with Whisper, …

Jun 12, 2025 Transfer Learning 50 min read

Transfer Learning (8): Multimodal Transfer

Derive contrastive learning (InfoNCE), CLIP's vision-language pretraining, BLIP's Q-Former bridge to LLMs, cross-modal alignment, and multimodal fusion strategies. Includes a from-scratch CLIP implementation in PyTorch.

Jun 6, 2025 Transfer Learning 32 min read

Transfer Learning (7): Zero-Shot Learning

A first-principles tour of zero-shot learning: attribute prototypes (DAP), compatibility functions, DeViSE, generative ZSL with f-CLSWGAN, the GZSL bias problem and calibration, and CLIP-style vision-language …