OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser
Shi, Jingze, Xie, Ting, Wu, Bingheng, Zheng, Chunjun, Wang, Kai
–arXiv.org Artificial Intelligence
The Transformers (Attention is All You Need (Vaswani et al. 2017)) architecture is popular in modern deep learning language modeling, which can directly capture the relationship between any two elements in a sequence, effectively handle long-distance dependencies, however, the architecture has two main drawbacks. First, when processing long sequences, its self-attention mechanism's quadratic complexity and cache size limit the ability to handle long contexts. Second, Transformer lacks a single summary state, which means that each generated token must compute over the entire context. Meanwhile, the Selective State Model (Mamba (Gu and Dao 2023)) has emerged. Mamba achieves linear scaling of sequence length during training and maintains a constant state size during generation through its selective state update mechanism.
arXiv.org Artificial Intelligence
Jun-24-2024
- Country:
- Europe
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Italy > Calabria
- Catanzaro Province > Catanzaro (0.04)
- Spain > Catalonia
- Asia > China
- Liaoning Province > Dalian (0.04)
- Europe
- Genre:
- Research Report (0.50)
- Industry:
- Education > Curriculum > Subject-Specific Education (1.00)
- Technology: