Description
Zero Redundancy Optimizer (ZeRO) is a sharded data parallel method for distributed training. ZeRODP removes the memory state redundancies across data-parallel processes by partitioning the model states instead of replicating them, and it retains the compute/communication efficiency by retaining the computational granularity and communication volume of DP using a dynamic communication schedule during training.
Papers Using This Method
Memory Analysis on the Training Course of DeepSeek Models2025-02-11Accelerating Large Language Model Training with Hybrid GPU-based Compression2024-09-04A Study of Optimizations for Fine-tuning Large Language Models2024-06-04Zero redundancy distributed learning with differential privacy2023-11-20Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models2023-11-07ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models2023-10-16Rethinking Memory and Communication Cost for Efficient Large Language Model Training2023-10-09ZeRO++: Extremely Efficient Collective Communication for Giant Model Training2023-06-16ZeRO: Memory Optimizations Toward Training Trillion Parameter Models2019-10-04