Tagged
CLIP
NLP (11): Multimodal Large Language Models
A deep dive into multimodal LLMs: contrastive vision-language pre-training with CLIP, parameter-efficient bridging with BLIP-2's Q-Former, visual instruction tuning with LLaVA, robust speech recognition with Whisper, …
Transfer Learning (8): Multimodal Transfer
Derive contrastive learning (InfoNCE), CLIP's vision-language pretraining, BLIP's Q-Former bridge to LLMs, cross-modal alignment, and multimodal fusion strategies. Includes a from-scratch CLIP implementation in PyTorch.
Transfer Learning (7): Zero-Shot Learning
A first-principles tour of zero-shot learning: attribute prototypes (DAP), compatibility functions, DeViSE, generative ZSL with f-CLSWGAN, the GZSL bias problem and calibration, and CLIP-style vision-language …