Improving Transformer with an Admixture of Attention Heads T an M. Nguyen
–Neural Information Processing Systems
At the core of FiSHformer is a novel finite admixture model of shared heads (FiSH) that samples attention matrices from a set of global attention matrices. The number of global attention matrices is much smaller than the number of local attention matrices generated. FiSHformers directly learn these global attention matrices rather than the local ones as in other transformers, thus significantly improving the computational and memory efficiency of the model.
Neural Information Processing Systems
Aug-17-2025, 22:35:02 GMT
- Country:
- Genre:
- Research Report (0.93)
- Technology: