MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling
Thiombiano, Abdoul Majid O., Hnich, Brahim, Mrad, Ali Ben, Mkaouer, Mohamed Wiem
–arXiv.org Artificial Intelligence
However, the quadratic complexityO ( n 2) of the attention mechanism (wheren is the sequence length) makes it computationally expensive to train and deploy large models, particularly for long sequences. This inherent limitation poses significant challenges for scalability and efficiency in real-world applications. One highly effective technique widely adopted to mitigate these challenges in training and deploying such massive models is the Mixture of Experts (MoE) framework [5, 11]. By design, in a MoE architecture, at inference time, the model intelligently utilizes only a sparse subset of its total parameters to process each input, leading to a dramatic reduction in the computational requirements at runtime and enabling more efficient scaling. The sparse MoE approach has been successfully applied to various models, demonstrating significant improvements in efficiency while maintaining or even enhancing performance [2]. Traditional Long Short-Term Memory (LSTM) networks, while demonstrably powerful in sequence modeling, inherently struggle with effectively managing long-term dependencies and achieving efficient associative recall, particularly when dealing with extended sequences. The Extended Long Short-Term Memory (xLSTM) architecture [1] directly addresses these fundamental limitations by introducing novel memory structures and optimized computation approaches within the LSTM unit itself.
arXiv.org Artificial Intelligence
May-6-2025
- Country:
- Africa > Middle East
- Tunisia (0.04)
- Asia > Middle East
- Jordan (0.04)
- Saudi Arabia > Al-Qassim Province
- Buraydah (0.04)
- North America > United States
- Michigan > Genesee County > Flint (0.14)
- Africa > Middle East
- Genre:
- Research Report > New Finding (0.46)
- Technology: