on Fine tuning with a Dense Model

Neural Information Processing Systems 

Our 8BMoE model achieves stronger pre-training perplexity than its dense counterpart. However, a better perplexity does not always directly translate to downstream performance as demonstrated in Section 4.4. To this end, we compare fine-tuning performance of the 8B dense model and MoE model in Table 1. As shown in the table, our MoE model using expert choice routing consistently outperforms the dense model across the 11 tasks in GLUE and SuperGLUE. We evaluate the downstream task fine-tuning performance by varying the capacity factors.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found