Supplement for: T A-MoE: T opology-A ware Large Scale Mixture-of-Expert Training Anonymous Author(s) Affiliation Address email 1 Appendix 1 1.1 Perplexity Evaluation Results
–Neural Information Processing Systems
To further validate the model convergence performance, we list the perplexity (PPL) at 10w step (near 7 days) of GPT -Medium (12 layers, hidden size 1024, intermediate size 2048, GShard, Capacity factor 1.2) with different expert numbers on the openwebtext2 dataset. "ladder-like" distribution trend that the ranks within a node has a high preference to dispatch the data The detailed model configurations are listed in Table 2. Name Layers Gate Stage 1 Stage 2 Stage 3 Stage 4 Capacity factor Swin Transformer v1 12 GShard concat 4x4, 96-d, LN {win.sz. Figure 2: Speedup of T A-MoE over FastMoE on Swin Transformer Based Model. 2 References Swin transformer: Hierarchical vision transformer using shifted windows.
Neural Information Processing Systems
Aug-16-2025, 19:59:08 GMT
- Technology: