Microsoft researchers have developed a new system called ZeRO++ to optimize the training of large AI models. This system addresses the challenges of high data transfer overhead and limited bandwidth. ZeRO++ builds upon existing ZeRO optimizations and offers enhanced communication strategies to improve training efficiency and reduce training time and cost.
Training large models like Turing-NLG, ChatGPT, and GPT-4 requires substantial memory and computing resources across multiple GPU devices. ZeRO++ introduces communication optimization strategies to overcome the limitations of ZeRO in scenarios with a small batch size per GPU or when training on low-bandwidth clusters.
The ZeRO family of optimizations, including ZeRO-Inference, allows for the partitioning of model states across GPUs instead of replication, utilizing the collective GPU memory and compute power. However, ZeRO incurs high communication overheads during training. ZeRO++ addresses this by incorporating three sets of communication optimizations: quantized weight communication (qwZ), hierarchical weight partition (hpZ), and quantized gradient communication (qgZ).
To reduce parameter communication volume, ZeRO++ employs quantization on weights using block-based quantization. This optimized quantization process is faster and more accurate than basic quantization. To minimize communication overhead during backward propagation, ZeRO++ trades GPU memory for communication by maintaining a full model copy within each machine. For gradient communication, ZeRO++ introduces qgZ, a novel quantized gradient communication paradigm that reduces cross-node traffic and latency.
These communication optimizations result in a significant reduction in communication volume. ZeRO++ achieves up to a 4x reduction compared to ZeRO, improving training throughput and efficiency. ZeRO++ offers 28% to 36% throughput improvement over ZeRO-3 in high-bandwidth clusters when using small batch sizes per GPU. ZeRO++ achieves an average of 2x speedup in low-bandwidth clusters compared to ZeRO-3, making large model training more accessible across a wider variety of clusters.
ZeRO++ is not limited to training scenarios but extends to reinforcement learning from human feedback (RLHF) training used in dialogue models. By integrating ZeRO++ with DeepSpeed-Chat, RLHF training can benefit from improved generation and training phases, achieving up to 2.25x better generation throughput and 1.26x better training throughput than ZeRO.
DeepSpeed has released ZeRO++ to make large model training more efficient and accessible to the AI community. The system is designed to accelerate training, reduce communication overhead, and enable larger batch sizes, ultimately saving time and resources. Researchers and practitioners can leverage ZeRO++ to train models like ChatGPT more effectively and explore new possibilities in AI.
Check out the blog article and paper for more information. Join the ML SubReddit, Discord Channel, and Email Newsletter for the latest AI research news, cool AI projects, and more. If you have any questions or if anything was missed in the article, feel free to email us.
Explore 100’s of AI tools in AI Tools Club.