AITopics | fastmoe

Collaborating Authors

fastmoe

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

8b465dd58ac50e1b0b22894fd581f62f-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-10-2026, 15:16:34 GMT

fastmoe, rank id, ta-moe, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.56)

Add feedback

8b465dd58ac50e1b0b22894fd581f62f-Paper-Conference.pdf

Neural Information Processing SystemsFeb-10-2026, 15:16:30 GMT

a-moe, dispatch pattern, topology, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Vision (0.69)
(3 more...)

Add feedback

Supplement for: T A-MoE: T opology-A ware Large Scale Mixture-of-Expert Training Anonymous Author(s) Affiliation Address email 1 Appendix 1 1.1 Perplexity Evaluation Results

Neural Information Processing SystemsAug-16-2025, 19:59:08 GMT

To further validate the model convergence performance, we list the perplexity (PPL) at 10w step (near 7 days) of GPT -Medium (12 layers, hidden size 1024, intermediate size 2048, GShard, Capacity factor 1.2) with different expert numbers on the openwebtext2 dataset. "ladder-like" distribution trend that the ranks within a node has a high preference to dispatch the data The detailed model configurations are listed in Table 2. Name Layers Gate Stage 1 Stage 2 Stage 3 Stage 4 Capacity factor Swin Transformer v1 12 GShard concat 4x4, 96-d, LN {win.sz. Figure 2: Speedup of T A-MoE over FastMoE on Swin Transformer Based Model. 2 References Swin transformer: Hierarchical vision transformer using shifted windows.

large language model, machine learning, scale mixture-of-expert training anonymous author, (15 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.37)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

Add feedback

T A-MoE: T opology-A ware Large Scale Mixture-of-Expert Training Chang Chen

Neural Information Processing SystemsAug-16-2025, 19:59:05 GMT

MoE dispatch pattern according to the network topology.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Vision (0.69)
(3 more...)

Add feedback

TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training

Chen, Chang, Li, Min, Wu, Zhihua, Yu, Dianhai, Yang, Chao

arXiv.org Artificial IntelligenceFeb-20-2023

Sparsely gated Mixture-of-Expert (MoE) has demonstrated its effectiveness in scaling up deep neural networks to an extreme scale. Despite that numerous efforts have been made to improve the performance of MoE from the model design or system optimization perspective, existing MoE dispatch patterns are still not able to fully exploit the underlying heterogeneous network environments. In this paper, we propose TA-MoE, a topology-aware routing strategy for large-scale MoE trainging, from a model-system co-design perspective, which can dynamically adjust the MoE dispatch pattern according to the network topology. Based on communication modeling, we abstract the dispatch problem into an optimization objective and obtain the approximate dispatch pattern under different topologies. On top of that, we design a topology-aware auxiliary loss, which can adaptively route the data to fit in the underlying topology without sacrificing the model accuracy. Experiments show that TA-MoE can substantially outperform its counterparts on various hardware and model configurations, with roughly 1.01x-1.61x,

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2302.09915

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

GPT-3 Scared You? Meet Wu Dao 2.0: A Monster of 1.75 Trillion Parameters

#artificialintelligenceOct-24-2021, 15:35:07 GMT

Jack Clark, OpenAI's policy director, calls this trend of copying GPT-3, "model diffusion." Yet, among all the copies, Wu Dao 2.0 holds the record of being the largest of all with a striking 1.75 trillion parameters (10x GPT-3). Coco Feng reported for South China Morning Post that Wu Dao 2.0 was trained on 4.9TB of high-quality text and image data, which makes GPT-3's training dataset (570GB) pale in comparison. Yet, it's worth noting OpenAI researchers curated 45TB of data to extract clean those 570GB. It can learn from text and images and tackle tasks that include both types of data (something GPT-3 can't do).

meet wu dao 2, trillion parameter, wu dao 2, (6 more...)

#artificialintelligence

Country: Asia > China (0.27)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.50)

Add feedback

Five Key Facts About Wu Dao 2.0: The Largest Transformer Model Ever Built - KDnuggets

#artificialintelligenceSep-6-2021, 12:05:16 GMT

I recently started a new newsletter focus on AI education and already has over 50,000 subscribers. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. It seems that every other month we have a new milestone in the race of building massively large transformer models. GPT-2 set up new records by building a 1.5 billion parameters model just to be surpassed by Microsoft's Turing NLG with 17 billion parameters.

architecture, transformer model, wu dao 2, (7 more...)

#artificialintelligence

Country: Asia > China > Beijing > Beijing (0.06)

Genre: Overview > Fact Book (0.42)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback