Improving Transformer with an Admixture of Attention Heads T an M. Nguyen

Neural Information Processing Systems 

At the core of FiSHformer is a novel finite admixture model of shared heads (FiSH) that samples attention matrices from a set of global attention matrices. The number of global attention matrices is much smaller than the number of local attention matrices generated. FiSHformers directly learn these global attention matrices rather than the local ones as in other transformers, thus significantly improving the computational and memory efficiency of the model.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found