Microsoft's ZeRO-Infinity Library Trains 32 Trillion Parameter AI Model
Microsoft recently announced ZeRO-Infinity, an addition to their open-source DeepSpeed AI training library that optimizes memory use for training very large deep-learning models. Using ZeRO-Infinity, Microsoft trained a model with 32 trillion parameters on a cluster of 32 GPUs, and demonstrated fine-tuning of a 1 trillion parameter model on a single GPU. The DeepSpeed team described the new features in a recent blog post. ZeRO-Infinity is the latest iteration of the Zero Redundancy Optimizer (ZeRO) family of memory optimization techniques. ZeRO-Infinity introduces several new strategies for addressing memory and bandwidth constraints when training large deep-learning models, including: a new offload engine for exploiting CPU and Non-Volatile Memory express (NVMe) memory, memory-centric tiling to handle large operators without model-parallelism, bandwidth-centric partitioning for reducing bandwidth costs, and an overlap-centric design for scheduling data communication.
Jun-25-2021, 03:10:47 GMT
- Technology: