More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

Open in new window