Mixture of Attention Heads: Selecting Attention Heads Per Token

Open in new window