AITopics | gpu cluster

Collaborating Authors

gpu cluster

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

6c7de1f27f7de61a6daddfffbe05c058-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-12-2026, 12:25:39 GMT

cauroc, clarify, dataset, (11 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.76)

Add feedback

Reviewer

Neural Information Processing SystemsOct-2-2025, 22:48:14 GMT

We thank the reviewers for their detailed and thoughtful feedback. We respond to each reviewer individually below. So it is a mean (across training seeds) of the median (across complexes) of AUROC. We will clarify this in the Table 2 and Figure 1 captions. We will clarify this point.

artificial intelligence, dataset, machine learning, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.76)

Add feedback

PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production

Guan, Yu, Yin, Zhiyu, Chen, Haoyu, Cheng, Sheng, Yang, Chaojie, Qian, Kun, Xu, Tianyin, Zhang, Yang, Zhao, Hanyu, Li, Yong, Lin, Wei, Cai, Dennis, Zhai, Ennan

arXiv.org Artificial IntelligenceJun-13-2025

Troubleshooting performance problems of large model training (LMT) is immensely challenging, due to unprecedented scales of modern GPU clusters, the complexity of software-hardware interactions, and the data intensity of the training process. Existing troubleshooting approaches designed for traditional distributed systems or datacenter networks fall short and can hardly apply to real-world training systems. In this paper, we present PerfTracker, the first online troubleshooting system utilizing fine-grained profiling, to diagnose performance issues of large-scale model training in production. PerfTracker can diagnose performance issues rooted in both hardware (e.g., GPUs and their interconnects) and software (e.g., Python functions and GPU operations). It scales to LMT on modern GPU clusters. PerfTracker effectively summarizes runtime behavior patterns of fine-grained LMT functions via online profiling, and leverages differential observability to localize the root cause with minimal production impact. PerfTracker has been deployed as a production service for large-scale GPU clusters of O(10, 000) GPUs (product homepage https://help.aliyun.com/zh/pai/user-guide/perftracker-online-performance-analysis-diagnostic-tool). It has been used to diagnose a variety of difficult performance issues.

artificial intelligence, machine learning, perftracker, (18 more...)

arXiv.org Artificial Intelligence

2506.08528

Country: North America > United States (0.68)

Genre: Research Report (1.00)

Industry: Information Technology (0.96)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.46)

Add feedback

Meta's Next Llama AI Models Are Training on a GPU Cluster 'Bigger Than Anything' Else

WIREDOct-31-2024

Meta CEO Mark Zuckerberg laid down the newest marker in generative AI training on Wednesday, saying that the next major release of the company's Llama model is being trained on a cluster of GPUs that's "bigger than anything" else that's been reported. Llama 4 development is well underway, Zuckerberg told investors and analysts on an earnings call, with an initial launch expected early next year. "We're training the Llama 4 models on a cluster that is bigger than 100,000 H100s, or bigger than anything that I've seen reported for what others are doing," Zuckerberg said, referring to the Nvidia chips popular for training AI systems. "I expect that the smaller Llama 4 models will be ready first." Increasing the scale of AI training with more computing power and data is widely believed to be key to developing significantly more capable AI models.

llama ai model, meta, zuckerberg, (6 more...)

WIRED

Industry: Information Technology (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.39)

Add feedback

Exclusive: New Research Finds Stark Global Divide in Ownership of Powerful AI Chips

TIME - TechAug-28-2024, 12:00:00 GMT

When we think of the "cloud," we often imagine data floating invisibly in the ether. But the reality is far more tangible: the cloud is located in huge buildings called data centers, filled with powerful, energy-hungry computer chips. Those chips, particularly graphics processing units (GPUs), have become a critical piece of infrastructure for the world of AI, as they are required to build and run powerful chatbots like ChatGPT. As the number of things you can do with AI grows, so does the geopolitical importance of high-end chips--and where they are located in the world. The U.S. and China are competing to amass stockpiles, with Washington enacting sanctions aimed at preventing Beijing from buying the most cutting-edge varieties.

artificial intelligence, chatbot, natural language, (17 more...)

TIME - Tech

Country:

Asia > China > Beijing > Beijing (0.25)
North America > United States (0.16)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.05)
Asia > Southeast Asia (0.05)

Genre: Research Report > New Finding (0.87)

Industry: Information Technology > Services (0.72)

Technology: Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.56)

Add feedback

G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems

Xiao, Youshao, Zhao, Shangchun, Zhou, Zhenglei, Huan, Zhaoxin, Ju, Lin, Zhang, Xiaolu, Wang, Lin, Zhou, Jun

arXiv.org Artificial IntelligenceJan-8-2024

Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not tailored for meta learning based DLRM models and have critical problems regarding efficiency in distributed training in the GPU cluster. It is because the conventional deep learning pipeline is not optimized for two task-specific datasets and two update loops in meta learning. This paper provides a high-performance framework for large-scale training for Optimization-based Meta DLRM models over the \textbf{G}PU cluster, namely \textbf{G}-Meta. Firstly, G-Meta utilizes both data parallelism and model parallelism with careful orchestration regarding computation and communication efficiency, to enable high-speed distributed training. Secondly, it proposes a Meta-IO pipeline for efficient data ingestion to alleviate the I/O bottleneck. Various experimental results show that G-Meta achieves notable training speed without loss of statistical performance. Since early 2022, G-Meta has been deployed in Alipay's core advertising and recommender system, shrinking the continuous delivery of models by four times. It also obtains 6.48\% improvement in Conversion Rate (CVR) and 1.06\% increase in CPM (Cost Per Mille) in Alipay's homepage display advertising, with the benefit of larger training samples and tasks.

g-meta, international conference, proceedings, (14 more...)

arXiv.org Artificial Intelligence

2401.04338

Country:

Asia > China > Zhejiang Province > Hangzhou (0.06)
Europe > United Kingdom > England > West Midlands > Birmingham (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Georgia > Chatham County > Savannah (0.04)

Genre: Research Report (0.69)

Industry: Information Technology > Services (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.57)

Add feedback

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

Yang, Fei, Peng, Shuang, Sun, Ning, Wang, Fangyu, Tan, Ke, Wu, Fu, Qiu, Jiezhong, Pan, Aimin

arXiv.org Artificial IntelligenceDec-11-2023

Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated remarkable accuracy in a wide range of tasks. However, training these models can incur significant expenses, often requiring tens of thousands of GPUs for months of continuous operation. Typically, this training is carried out in specialized GPU clusters equipped with homogeneous high-speed Remote Direct Memory Access (RDMA) network interface cards (NICs). The acquisition and maintenance of such dedicated clusters is challenging. Current LLM training frameworks, like Megatron-LM and Megatron-DeepSpeed, focus primarily on optimizing training within homogeneous cluster settings. In this paper, we introduce Holmes, a training framework for LLMs that employs thoughtfully crafted data and model parallelism strategies over the heterogeneous NIC environment. Our primary technical contribution lies in a novel scheduling method that intelligently allocates distinct computational tasklets in LLM training to specific groups of GPU devices based on the characteristics of their connected NICs. Furthermore, our proposed framework, utilizing pipeline parallel techniques, demonstrates scalability to multiple GPU clusters, even in scenarios without high-speed interconnects between nodes in distinct clusters. We conducted comprehensive experiments that involved various scenarios in the heterogeneous NIC environment. In most cases, our framework achieves performance levels close to those achievable with homogeneous RDMA-capable networks (InfiniBand or RoCE), significantly exceeding training efficiency within the pure Ethernet environment. Additionally, we verified that our framework outperforms other mainstream LLM frameworks under heterogeneous NIC environment in terms of training efficiency and can be seamlessly integrated with them.

gpu device, heterogeneous nic environment, parallelism, (14 more...)

arXiv.org Artificial Intelligence

2312.03549

Country: Asia > China (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Network Contention-Aware Cluster Scheduling with Reinforcement Learning

Ryu, Junyeol, Eo, Jeongyoon

arXiv.org Artificial IntelligenceOct-31-2023

With continuous advances in deep learning, distributed training is becoming common in GPU clusters. Specifically, for emerging workloads with diverse amounts, ratios, and patterns of communication, we observe that network contention can significantly degrade training throughput. However, widely used scheduling policies often face limitations as they are agnostic to network contention between jobs. In this paper, we present a new approach to mitigate network contention in GPU clusters using reinforcement learning. We formulate GPU cluster scheduling as a reinforcement learning problem and opt to learn a network contention-aware scheduling policy that efficiently captures contention sensitivities and dynamically adapts scheduling decisions through continuous evaluation and improvement. We show that compared to widely used scheduling policies, our approach reduces average job completion time by up to 18.2\% and effectively cuts the tail job completion time by up to 20.7\% while allowing a preferable trade-off between average job completion time and resource utilization.

contention sensitivity, scheduling, utilization, (12 more...)

arXiv.org Artificial Intelligence

2310.20209

Country:

Asia > South Korea > Seoul > Seoul (0.05)
North America > United States > Washington > King County > Seattle (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

Punica: Multi-Tenant LoRA Serving

Chen, Lequn, Ye, Zihao, Wu, Yongji, Zhuo, Danyang, Ceze, Luis, Krishnamurthy, Arvind

arXiv.org Artificial IntelligenceOct-27-2023

Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token. We thus need to enable batching for different LoRA models. We increasingly popular in specializing pre-trained large thus only need to focus on the decode stage performance. LoRA retains the weights of the pretrained we can apply straightforward techniques, e.g., on-demand model and introduces trainable rank decomposition loading of LoRA model weights.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2310.18547

Country:

North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > San Diego County > Carlsbad (0.04)
(5 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

The Chan Zuckerberg Initiative is building a massive GPU cluster to 'cure, prevent or manage all diseases'

EngadgetSep-19-2023, 20:54:54 GMT

The Chan Zuckerberg Initiative (CZI), the philanthropic organization created in 2015 by Priscilla Chan and her husband Mark Zuckerberg, announced a bold new generative AI initiative today. The group is funding and building a high-end GPU cluster that will use AI to create predictive models of healthy and diseased cells; it hopes they'll help researchers better understand the human body's cells and cellular reactions. The group believes the collection of computers will help it achieve its incredibly lofty goal of helping to "cure, prevent, or manage all diseases by the end of this century." "Researchers are gathering more data than ever before about the trillions of cells within our bodies, and it's too complex for our brains to grapple with," Jeff MacGregor, CZI vice president of communications, wrote in an emailed statement to Engadget. He lists an example of imaging one cell at nanometer resolution, which would use the same amount of data as 83,000 photos on a smartphone.

chan zuckerberg initiative, gpu cluster, massive gpu cluster, (6 more...)

Engadget

Industry:

Health & Medicine (0.77)
Law > Business Law (0.62)
Social Sector (0.57)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.79)

Add feedback