MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Gale, Trevor, Narayanan, Deepak, Young, Cliff, Zaharia, Matei
–arXiv.org Artificial Intelligence
Our system is motivated by the limitations of current frameworks, which restrict the dynamic routing in MoE layers to satisfy the constraints of existing software and hardware. These formulations force a tradeoff between model quality and hardware efficiency, as users must choose between dropping tokens from the computation or wasting computation and memory on padding. To address these limitations, we reformulate MoE computation in terms of block-sparse operations and develop new block-sparse GPU kernels that efficiently handle the dynamism present in MoEs. Our approach never drops tokens and maps efficiently to modern hardware, enabling end-to-end training speedups of up to 40% over MoEs trained with the state-of-the-art Tutel library and 2.4 over DNNs trained with the highly-optimized Megatron-LM framework. The past decade has seen significant progress in are fundamental to these architectures. However, existing algorithms and high-performance software to make sparsity hardware and software for deep learning make it difficult practically useful (Gray et al., 2017; Narang et al., 2017; to meet this challenge. For example, TPUs and their XLA Kalchbrenner et al., 2018; Elsen et al., 2020; Gale et al., compiler require all tensor shapes to be known statically 2020).
arXiv.org Artificial Intelligence
Nov-28-2022
- Country:
- Europe
- North America
- Canada
- United States
- Arizona > Maricopa County
- Phoenix (0.04)
- California
- Georgia > Fulton County
- Atlanta (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Missouri > St. Louis County
- St. Louis (0.04)
- New York > New York County
- New York City (0.04)
- Washington > King County
- Redmond (0.04)
- Arizona > Maricopa County
- Genre:
- Research Report (0.50)
- Industry:
- Information Technology (0.30)
- Technology: