On the Representation Collapse of Sparse Mixture of Experts

Jan-19-2025, 02:50:11 GMT–Neural Information Processing Systems

Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks.

huang, representation collapse, sparse mixture

Neural Information Processing Systems

Jan-19-2025, 02:50:11 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.45)