Tags

Fsdp

Mar 29, 2026 LLM Engineering 42 min read

LLM Engineering (3): Pretraining at Scale

Data mixing, deduplication, contamination, μP, FSDP vs ZeRO-3 vs pipeline parallel, the practical 200B-token cliff, and the failure modes that only appear above 1000 GPUs.