Dense Backpropagation Improves Training for Sparse Mixture-of-Experts

Open in new window