AITopics | megatron

Collaborating Authors

megatron

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PromptBlack-box APIRaw runtime(= denoised runtime+ noise)Prompt has num_prompt_tokens, output hasnum_output_tokensChosen hardware and software(e.g., A100 GPUs and Megatron)Idealized runtimePrompt

Neural Information Processing SystemsApr-29-2026, 20:51:12 GMT

Large language models (LLMs) are highly capable but also computationally expensive. Characterizing the fundamental tradeoff between inference efficiency and model capabilities is thus important, but requires an efficiency metric that is comparable across models from different providers. Unfortunately, raw runtimes measured through black-box APIs do not satisfy this property: model providers can implement software and hardware optimizations orthogonal to the model, and shared infrastructure introduces performance contention. We propose a new metric for inference efficiency called idealized runtime, that puts models on equal footing as though they were served on uniform hardware and software without performance contention, and a cost model to efficiently estimate this metric for autoregressive Transformer models. We also propose variants of the idealized runtime that incorporate the number and type of accelerators needed to serve the model. Using these metrics, we compare ten LLMs developed in 2022 to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

References

Neural Information Processing SystemsApr-25-2026, 06:15:28 GMT

Distributed balanced partitioning via linear embedding. Language models are few-shot learners. Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server. More effective distributed ml via a stale synchronous parallel parameter server. Transgan: Two pure transformers can make one strong gan, and that can scale up.

artificial intelligence, arxiv preprint arxiv, machine learning, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness

Neural Information Processing SystemsApr-25-2026, 06:15:25 GMT

Scaling up model sizes can lead to fundamentally new capabilities in many machine learning (ML) tasks. However, training big models requires strong distributed system expertise to carefully design model-parallel execution strategies that suit the model architectures and cluster setups. In this paper, we develop AMP, a framework that automatically derives such strategies. AMP identifies a valid space of model parallelism strategies and efficiently searches the space for high-performed strategies, by leveraging a cost model designed to capture the heterogeneity of the model and cluster specifications. Unlike existing methods, AMP is specifically tailored to support complex models composed of uneven layers and cluster setups with more heterogeneous accelerators and bandwidth. We evaluate AMP on popular models and cluster setups from public clouds and show that AMP returns parallel strategies that match the expert-tuned strategies on typical cluster setups. On heterogeneous clusters or models with heterogeneous architectures, AMP finds strategies with 1.54 and 1.77 higher throughput than state-of-the-art model-parallel systems, respectively.

artificial intelligence, machine learning, optimization problem, (20 more...)

Neural Information Processing Systems

Genre: Research Report (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference

Jin, Zewen, Wang, Shengnan, Zhu, Jiaan, Zhan, Hongrui, Bai, Youhui, Zhang, Lin, Ming, Zhenyu, Li, Cheng

arXiv.org Artificial IntelligenceMar-7-2025

The Mixture-of-Experts (MoE) structure scales the Transformer-based large language models (LLMs) and improves their performance with only the sub-linear increase in computation resources. Recently, a fine-grained DeepSeekMoE structure is proposed, which can further improve the computing efficiency of MoE without performance degradation. However, the All-to-All communication introduced by MoE has become a bottleneck, especially for the fine-grained structure, which typically involves and activates more experts, hence contributing to heavier communication overhead. In this paper, we propose a novel MoE structure named BigMac, which is also fine-grained but with high communication efficiency. The innovation of BigMac is mainly due to that we abandon the \textbf{c}ommunicate-\textbf{d}escend-\textbf{a}scend-\textbf{c}ommunicate (CDAC) manner used by fine-grained MoE, which leads to the All-to-All communication always taking place at the highest dimension. Instead, BigMac designs an efficient \textbf{d}escend-\textbf{c}ommunicate-\textbf{c}ommunicate-\textbf{a}scend (DCCA) manner. Specifically, we add a descending and ascending projection at the entrance and exit of the expert, respectively, which enables the communication to perform at a very low dimension. Furthermore, to adapt to DCCA, we re-design the structure of small experts, ensuring that the expert in BigMac has enough complexity to address tokens. Experimental results show that BigMac achieves comparable or even better model quality than fine-grained MoEs with the same number of experts and a similar number of total parameters. Equally importantly, BigMac reduces the end-to-end latency by up to 3.09$\times$ for training and increases the throughput by up to 3.11$\times$ for inference on state-of-the-art AI computing frameworks including Megatron, Tutel, and DeepSpeed-Inference.

bigmac, communication, moe model, (16 more...)

arXiv.org Artificial Intelligence

2502.16927

Country:

Asia > China > Anhui Province > Hefei (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Unicron: Economizing Self-Healing LLM Training at Scale

He, Tao, Li, Xue, Wang, Zhibin, Qian, Kun, Xu, Jingbo, Yu, Wenyuan, Zhou, Jingren

arXiv.org Artificial IntelligenceDec-29-2023

Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale language model training.

megatron, training process, unicron, (16 more...)

arXiv.org Artificial Intelligence

2401.00134

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > Promising Solution (0.48)

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Architecture (1.00)

Add feedback

Clang, Clang, You're Dead! Evil Movie Robots, Ranked

#artificialintelligenceDec-30-2022, 02:50:12 GMT

Yes, you have your R2-D2, your BB-8, Data (Brent Spiner), even WALL-E. So while we still can, take notes on these robots before they become our technological overlords. Not only are the Fem-bots evil, they are Evil's evil. Dr. Evil's (Mike Myers), to be precise. Attractive and seductive, the Fem-bots were a means of distracting, and killing, Austin Powers (Mike Myers), not only with their agility but with their "machine gun jubblies," guns protruding from their breasts.

clang, evil movie robot, robot, (14 more...)

#artificialintelligence

Country: Asia > China > Hong Kong (0.05)

Technology: Information Technology > Artificial Intelligence > Robots (0.95)

Add feedback

Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Megatron

#artificialintelligenceNov-6-2022, 14:25:21 GMT

Large language models (LLMs) are some of the most advanced deep learning algorithms that are capable of understanding written language. Many modern LLMs are built using the transformer network introduced by Google in 2017 in the Attention Is All You Need research paper. NVIDIA NeMo Megatron is an end-to-end GPU-accelerated framework for training and deploying transformer-based LLMs up to a trillion parameters. In September 2022, NVIDIA announced that NeMo Megatron is now available in Open Beta, allowing you to train and deploy LLMs using your own data. With this announcement, several pretrained checkpoints have been uploaded to HuggingFace, enabling anyone to deploy LLMs locally using GPUs.

megatron, nemo megatron, triton inference server, (10 more...)

#artificialintelligence

Industry: Information Technology > Hardware (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness

Li, Dacheng, Wang, Hongyi, Xing, Eric, Zhang, Hao

arXiv.org Artificial IntelligenceOct-13-2022

Scaling up model sizes can lead to fundamentally new capabilities in many machine learning (ML) tasks. However, training big models requires strong distributed system expertise to carefully design model-parallel execution strategies that suit the model architectures and cluster setups. In this paper, we develop AMP, a framework that automatically derives such strategies. AMP identifies a valid space of model parallelism strategies and efficiently searches the space for high-performed strategies, by leveraging a cost model designed to capture the heterogeneity of the model and cluster specifications. Unlike existing methods, AMP is specifically tailored to support complex models composed of uneven layers and cluster setups with more heterogeneous accelerators and bandwidth. We evaluate AMP on popular models and cluster setups from public clouds and show that AMP returns parallel strategies that match the expert-tuned strategies on typical cluster setups. On heterogeneous clusters or models with heterogeneous architectures, AMP finds strategies with 1.54x and 1.77x higher throughput than state-of-the-art model-parallel systems, respectively.

artificial intelligence, machine learning, optimization problem, (20 more...)

arXiv.org Artificial Intelligence

2210.07297

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Filters

Collaborating Authors

megatron

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

PromptBlack-box APIRaw runtime(= denoised runtime+ noise)Prompt has num_prompt_tokens, output hasnum_output_tokensChosen hardware and software(e.g., A100 GPUs and Megatron)Idealized runtimePrompt

References

AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness

2b4bfa1cebe78d125fefd7ea6ffcfc6d-Supplemental-Conference.pdf

2b4bfa1cebe78d125fefd7ea6ffcfc6d-Paper-Conference.pdf

BigMac: A Communication-Efficient Mixture-of-Experts Model Structure for Fast Training and Inference

Unicron: Economizing Self-Healing LLM Training at Scale

Clang, Clang, You're Dead! Evil Movie Robots, Ranked

Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Megatron

AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness