DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

Jan-19-2025, 14:06:14 GMT–Neural Information Processing Systems

The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable "sparse gate'" to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation.

differentiable selection, dselect-k, multi-task learning, (3 more...)

Neural Information Processing Systems

Jan-19-2025, 14:06:14 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)