dselect-k
Supplementaryto"DSelect-k: Differentiable SelectionintheMixtureofExpertswithApplications toMulti-TaskLearning "
MTL: InMTL, deep learning-based architectures that perform soft-parameter sharing, i.e., share model parameters partially, are proving to be effective at exploiting both the commonalities and differences among tasks [6]. Ourwork is also related to [5] who introduced "routers" (similar to gates) that can choose which layers or components of layers to activate per-task. The routers in the latter work are not differentiable and requirereinforcementlearning. To construct α, there are two cases to consider: (i)s = k and (ii) s < k. If s = k, then set αi = log(w ti) for i [k]. Our base case is fort = 1.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > France (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.96)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning
The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable sparse gate' to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation.
f5ac21cd0ef1b88e9848571aeb53551a-Supplemental.pdf
Supplementary to "DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-T ask Learning" In MTL, deep learning-based architectures that perform soft-parameter sharing, i.e., share model parameters partially, are proving to be effective at exploiting both the commonalities and differences among tasks [ This approach is similar to static gating, but it does not support per-example gating. Moreover, the number of nonzeros cannot be directly controlled (in contrast to our gate). Next, we show Direction (II). From the definition of r ( .), the following holds: r (S (v)) The penalty described above is part of our TensorFlow implementation of DSelect-k. Note that the logistic function is re-scaled to be on the same scale as the smooth-step function.Figure B.1: The Smooth-step ( γ = 1) and Logistic functions.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > France (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.96)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning
The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable "sparse gate'" to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation.
DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning
Hazimeh, Hussein, Zhao, Zhe, Chowdhery, Aakanksha, Sathiamoorthy, Maheswaran, Chen, Yihua, Mazumder, Rahul, Hong, Lichan, Chi, Ed H.
The Mixture-of-experts (MoE) architecture is showing promising results in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: the first, continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. Our gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k in the context of MTL, on both synthetic and real datasets with up to 128 tasks. Our experiments indicate that MoE models based on DSelect-k can achieve statistically significant improvements in predictive and expert selection performance. Notably, on a real-world large-scale recommender system, DSelect-k achieves over 22% average improvement in predictive performance compared to the Top-k gate. We provide an open-source TensorFlow implementation of our gate.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)