Tags
Data-Mixing
LLM Engineering (3): Pretraining at Scale
Data mixing, deduplication, contamination, μP, FSDP vs ZeRO-3 vs pipeline parallel, the practical 200B-token cliff, and the failure modes that only appear above 1000 GPUs.
Data mixing, deduplication, contamination, μP, FSDP vs ZeRO-3 vs pipeline parallel, the practical 200B-token cliff, and the failure modes that only appear above 1000 GPUs.