Mixture of Tokens: Continuous MoE through Cross-Example Aggregation

Neural Information Processing Systems 

Most widely adopted MoE models are discontinuous with respect to their parameters - often referred to as sparse .