on Fine tuning with a Dense Model

Apr-25-2026, 07:44:42 GMT–Neural Information Processing Systems

Our 8BMoE model achieves stronger pre-training perplexity than its dense counterpart. However, a better perplexity does not always directly translate to downstream performance as demonstrated in Section 4.4. To this end, we compare fine-tuning performance of the 8B dense model and MoE model in Table 1. As shown in the table, our MoE model using expert choice routing consistently outperforms the dense model across the 11 tasks in GLUE and SuperGLUE. We evaluate the downstream task fine-tuning performance by varying the capacity factors.

artificial intelligence, expert choice, machine learning, (16 more...)

Neural Information Processing Systems

Apr-25-2026, 07:44:42 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.40)

Duplicate Docs Excel Report

Title
2f00ecd787b432c1d36f3de9800728eb-Supplemental-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found