fastmoe
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China (0.04)
Supplement for: T A-MoE: T opology-A ware Large Scale Mixture-of-Expert Training Anonymous Author(s) Affiliation Address email 1 Appendix 1 1.1 Perplexity Evaluation Results
To further validate the model convergence performance, we list the perplexity (PPL) at 10w step (near 7 days) of GPT -Medium (12 layers, hidden size 1024, intermediate size 2048, GShard, Capacity factor 1.2) with different expert numbers on the openwebtext2 dataset. "ladder-like" distribution trend that the ranks within a node has a high preference to dispatch the data The detailed model configurations are listed in Table 2. Name Layers Gate Stage 1 Stage 2 Stage 3 Stage 4 Capacity factor Swin Transformer v1 12 GShard concat 4x4, 96-d, LN {win.sz. Figure 2: Speedup of T A-MoE over FastMoE on Swin Transformer Based Model. 2 References Swin transformer: Hierarchical vision transformer using shifted windows.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China (0.04)
TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training
Chen, Chang, Li, Min, Wu, Zhihua, Yu, Dianhai, Yang, Chao
Sparsely gated Mixture-of-Expert (MoE) has demonstrated its effectiveness in scaling up deep neural networks to an extreme scale. Despite that numerous efforts have been made to improve the performance of MoE from the model design or system optimization perspective, existing MoE dispatch patterns are still not able to fully exploit the underlying heterogeneous network environments. In this paper, we propose TA-MoE, a topology-aware routing strategy for large-scale MoE trainging, from a model-system co-design perspective, which can dynamically adjust the MoE dispatch pattern according to the network topology. Based on communication modeling, we abstract the dispatch problem into an optimization objective and obtain the approximate dispatch pattern under different topologies. On top of that, we design a topology-aware auxiliary loss, which can adaptively route the data to fit in the underlying topology without sacrificing the model accuracy. Experiments show that TA-MoE can substantially outperform its counterparts on various hardware and model configurations, with roughly 1.01x-1.61x,
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China (0.04)
GPT-3 Scared You? Meet Wu Dao 2.0: A Monster of 1.75 Trillion Parameters
Jack Clark, OpenAI's policy director, calls this trend of copying GPT-3, "model diffusion." Yet, among all the copies, Wu Dao 2.0 holds the record of being the largest of all with a striking 1.75 trillion parameters (10x GPT-3). Coco Feng reported for South China Morning Post that Wu Dao 2.0 was trained on 4.9TB of high-quality text and image data, which makes GPT-3's training dataset (570GB) pale in comparison. Yet, it's worth noting OpenAI researchers curated 45TB of data to extract clean those 570GB. It can learn from text and images and tackle tasks that include both types of data (something GPT-3 can't do).
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.50)
Five Key Facts About Wu Dao 2.0: The Largest Transformer Model Ever Built - KDnuggets
I recently started a new newsletter focus on AI education and already has over 50,000 subscribers. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. It seems that every other month we have a new milestone in the race of building massively large transformer models. GPT-2 set up new records by building a 1.5 billion parameters model just to be surpassed by Microsoft's Turing NLG with 17 billion parameters.