A Review of Sparse Expert Models in Deep Learning
Fedus, William, Dean, Jeff, Zoph, Barret
–arXiv.org Artificial Intelligence
Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models. The resulting models have demonstrated significant improvements across diverse domains such as natural language processing, computer vision, and speech recognition. We review the concept of sparse expert models, provide a basic description of the common algorithms, contextualize the advances in the deep learning era, and conclude by highlighting areas for future work. Remarkable advances in machine learning - especially in natural language - have been achieved by increasing the computational budget, training data, and model size. However, state-of-the-art models now require thousands of specialized, interconnected accelerators for weeks or months at a time. These models are therefore expensive to produce and incur high energy costs (Patterson et al., 2021). Therefore, as the scale of machine learning systems has increased, the field has sought more efficient training and serving paradigms. Sparse expert models have risen as a promising solution. A dense model (left) sends both input tokens to the same feed-forward network parameters (FFN). In this diagram, each model uses a similar amount of computation, but the sparse model has more unique parameters. Note while this figure showcases a specific and common approach of sparse feed-forward network layers in a Transformer (Vaswani et al., 2017), the technique is more general. Sparse expert models, of which, Mixture-of-Experts (MoE) is the most popular variant, are neural networks where a set of the parameters are partitioned into "experts", each with a unique weight. As a result, each example only interacts with a subset of the network parameters, contrasting the usual approach where the entire network is used for each input. Because only a fraction of the experts are used for each example, the amount of computation may remain small relative to the total model size. Many modern sparse expert models draw inspiration from Shazeer et al. (2017), which trained the largest model at the time and achieved state-of-the-art language modeling and translation results.
arXiv.org Artificial Intelligence
Sep-4-2022
- Country:
- Asia
- Japan > Honshū
- Chūbu > Toyama Prefecture > Toyama (0.04)
- Middle East > Jordan (0.05)
- Japan > Honshū
- Asia
- Genre:
- Research Report > Promising Solution (0.86)
- Technology: