Supplement for: T A-MoE: T opology-A ware Large Scale Mixture-of-Expert Training Anonymous Author(s) Affiliation Address email 1 Appendix 1 1.1 Perplexity Evaluation Results

Aug-16-2025, 19:59:08 GMT–Neural Information Processing Systems

To further validate the model convergence performance, we list the perplexity (PPL) at 10w step (near 7 days) of GPT -Medium (12 layers, hidden size 1024, intermediate size 2048, GShard, Capacity factor 1.2) with different expert numbers on the openwebtext2 dataset. "ladder-like" distribution trend that the ranks within a node has a high preference to dispatch the data The detailed model configurations are listed in Table 2. Name Layers Gate Stage 1 Stage 2 Stage 3 Stage 4 Capacity factor Swin Transformer v1 12 GShard concat 4x4, 96-d, LN {win.sz. Figure 2: Speedup of T A-MoE over FastMoE on Swin Transformer Based Model. 2 References Swin transformer: Hierarchical vision transformer using shifted windows.

large language model, machine learning, scale mixture-of-expert training anonymous author, (15 more...)

Neural Information Processing Systems

Aug-16-2025, 19:59:08 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (0.37)
  - Machine Learning > Neural Networks
    - Deep Learning (0.37)

Duplicate Docs Excel Report

Title
8b465dd58ac50e1b0b22894fd581f62f-Supplemental-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found