ZeRO

GeneralIntroduced 20009 papers

Description

Zero Redundancy Optimizer (ZeRO) is a sharded data parallel method for distributed training. ZeRODP removes the memory state redundancies across data-parallel processes by partitioning the model states instead of replicating them, and it retains the compute/communication efficiency by retaining the computational granularity and communication volume of DP using a dynamic communication schedule during training.

Papers Using This Method

Memory Analysis on the Training Course of DeepSeek Models2025-02-11 Accelerating Large Language Model Training with Hybrid GPU-based Compression2024-09-04 A Study of Optimizations for Fine-tuning Large Language Models2024-06-04 Zero redundancy distributed learning with differential privacy2023-11-20 Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models2023-11-07 ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models2023-10-16 Rethinking Memory and Communication Cost for Efficient Large Language Model Training2023-10-09 ZeRO++: Extremely Efficient Collective Communication for Giant Model Training2023-06-16 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models2019-10-04