AITopics | Wu, Mengdi

Collaborating Authors

Wu, Mengdi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Jeon, Byungsoo, Wu, Mengdi, Cao, Shiyi, Kim, Sunghyun, Park, Sunghyun, Aggarwal, Neeraj, Unger, Colin, Arfeen, Daiyaan, Liao, Peiyuan, Miao, Xupeng, Alizadeh, Mohammad, Ganger, Gregory R., Chen, Tianqi, Jia, Zhihao

arXiv.org Artificial IntelligenceJun-24-2024

Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device. Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into multiple stages, which concurrently perform DNN training for different micro-batches in a pipeline fashion. However, existing pipeline-parallel approaches only consider sequential pipeline stages and thus ignore the topology of a DNN, resulting in missed model-parallel opportunities. This paper presents graph pipeline parallelism (GPP), a new pipeline-parallel scheme that partitions a DNN into pipeline stages whose dependencies are identified by a directed acyclic graph. GPP generalizes existing sequential pipeline parallelism and preserves the inherent topology of a DNN to enable concurrent execution of computationally-independent operators, resulting in reduced memory requirement and improved GPU performance. In addition, we develop GraphPipe, a distributed system that exploits GPP strategies to enable performant and scalable DNN training. GraphPipe partitions a DNN into a graph of stages, optimizes micro-batch schedules for these stages, and parallelizes DNN training using the discovered GPP strategies. Evaluation on a variety of DNNs shows that GraphPipe outperforms existing pipeline-parallel systems such as PipeDream and Piper by up to 1.6X. GraphPipe also reduces the search time by 9-21X compared to PipeDream and Piper.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.17145

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.15)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.14)

Genre: Research Report (0.50)

Industry: Energy > Oil & Gas > Midstream (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Multi-Level Superoptimizer for Tensor Programs

Wu, Mengdi, Cheng, Xinhao, Padon, Oded, Jia, Zhihao

arXiv.org Artificial IntelligenceMay-9-2024

For We introduce Mirage, the first multi-level superoptimizer for a given algorithm, these optimizers automatically generate tensor programs. A key idea in Mirage is Graphs, a uniform performant kernels by searching over possible strategies for representation of tensor programs at the kernel, thread block, executing the kernel on the target hardware. However, due and thread levels of the GPU compute hierarchy. Graphs to the linear algebra nature of DNNs, a tensor program can enable Mirage to discover novel optimizations that combine be represented by a wide spectrum of mathematically equivalent algebraic transformations, schedule transformations, and algorithms, and existing schedule-based optimizers only generation of new custom kernels. To navigate the large consider kernels whose algorithms are manually specified search space, Mirage introduces a pruning technique based by users, resulting in missed optimization opportunities.

artificial intelligence, graph, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2405.05751

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

Miao, Xupeng, Oliaro, Gabriele, Cheng, Xinhao, Wu, Mengdi, Unger, Colin, Jia, Zhihao

arXiv.org Artificial IntelligenceFeb-28-2024

Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests. As a result, shared GPU resources are underutilized, leading to inefficiencies. To address this problem, we present FlexLLM, the first system that can serve inference and parameter-efficient finetuning requests in the same iteration. Our system leverages the complementary nature of these two tasks and utilizes shared GPU resources to run them jointly, using a method called co-serving. To achieve this, FlexLLM introduces a novel token-level finetuning mechanism, which breaks down the finetuning computation of a sequence into smaller token-level computations and uses dependent parallelization and graph pruning, two static compilation optimizations, to minimize the memory overhead and latency for co-serving. Compared to existing systems, FlexLLM's co-serving approach reduces the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory requirement of finetuning by up to 36% while maintaining a low inference latency and improving finetuning throughput. For example, under a heavy inference workload, FlexLLM can still preserve more than 80% of the peak finetuning throughput, whereas existing systems cannot make any progress with finetuning. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2402.18789

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback