Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Neural Information Processing Systems 

In contrast to the standard sparse MoE for each entire feed-forward network, BTT -MoE learns an MoE in every single linear layer of the model, including the projection matrices in the attention blocks.