AITopics | Cheng, Xinhao

Collaborating Authors

Cheng, Xinhao

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Multi-Level Superoptimizer for Tensor Programs

Wu, Mengdi, Cheng, Xinhao, Padon, Oded, Jia, Zhihao

arXiv.org Artificial IntelligenceMay-9-2024

For We introduce Mirage, the first multi-level superoptimizer for a given algorithm, these optimizers automatically generate tensor programs. A key idea in Mirage is Graphs, a uniform performant kernels by searching over possible strategies for representation of tensor programs at the kernel, thread block, executing the kernel on the target hardware. However, due and thread levels of the GPU compute hierarchy. Graphs to the linear algebra nature of DNNs, a tensor program can enable Mirage to discover novel optimizations that combine be represented by a wide spectrum of mathematically equivalent algebraic transformations, schedule transformations, and algorithms, and existing schedule-based optimizers only generation of new custom kernels. To navigate the large consider kernels whose algorithms are manually specified search space, Mirage introduces a pruning technique based by users, resulting in missed optimization opportunities.

artificial intelligence, graph, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2405.05751

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

Miao, Xupeng, Oliaro, Gabriele, Cheng, Xinhao, Wu, Mengdi, Unger, Colin, Jia, Zhihao

arXiv.org Artificial IntelligenceFeb-28-2024

Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests. As a result, shared GPU resources are underutilized, leading to inefficiencies. To address this problem, we present FlexLLM, the first system that can serve inference and parameter-efficient finetuning requests in the same iteration. Our system leverages the complementary nature of these two tasks and utilizes shared GPU resources to run them jointly, using a method called co-serving. To achieve this, FlexLLM introduces a novel token-level finetuning mechanism, which breaks down the finetuning computation of a sequence into smaller token-level computations and uses dependent parallelization and graph pruning, two static compilation optimizations, to minimize the memory overhead and latency for co-serving. Compared to existing systems, FlexLLM's co-serving approach reduces the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory requirement of finetuning by up to 36% while maintaining a low inference latency and improving finetuning throughput. For example, under a heavy inference workload, FlexLLM can still preserve more than 80% of the peak finetuning throughput, whereas existing systems cannot make any progress with finetuning. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2402.18789

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Miao, Xupeng, Oliaro, Gabriele, Zhang, Zhihao, Cheng, Xinhao, Jin, Hongyi, Chen, Tianqi, Jia, Zhihao

arXiv.org Artificial IntelligenceDec-23-2023

In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2312.15234

Country: North America > United States > Hawaii (0.14)

Genre:

Research Report (1.00)
Overview (1.00)

Industry: Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification

Miao, Xupeng, Oliaro, Gabriele, Zhang, Zhihao, Cheng, Xinhao, Wang, Zeyu, Wong, Rae Ying Yee, Zhu, Alan, Yang, Lijie, Shi, Xiaoxiang, Shi, Chunan, Chen, Zhuoming, Arfeen, Daiyaan, Abhyankar, Reyna, Jia, Zhihao

arXiv.org Artificial IntelligenceAug-16-2023

This approach is also called autoregressive decoding because each The high computational and memory requirements of generative generated token is also used as input for generating future large language models (LLMs) make it challenging tokens. This dependency between tokens is crucial for many to serve them quickly and cheaply. This paper introduces NLP tasks that require preserving the order and context of the SpecInfer, an LLM serving system that accelerates generative generated tokens, such as text completion [53]. LLM inference with speculative inference and token tree Existing LLM systems generally use an incremental decoding verification. A key insight behind SpecInfer is to combine approach to serving a request where the system computes various collectively boost-tuned small language models to the activations for all prompt tokens in a single step and then jointly predict the LLM's outputs; the predictions are organized iteratively decodes one new token using the input prompt as a token tree, whose nodes each represent a candidate and all previously generated tokens. This approach respects token sequence. The correctness of all candidate token sequences data dependencies between tokens, but achieves suboptimal represented by a token tree is verified against the runtime performance and limited GPU utilization, since the LLM in parallel using a novel tree-based parallel decoding degree of parallelism within each request is greatly limited in mechanism. SpecInfer uses an LLM as a token tree verifier the incremental phase. In addition, the attention mechanism of instead of an incremental decoder, which significantly Transformer [46] requires accessing the keys and values of all reduces the end-to-end latency and computational requirement previous tokens to compute the attention output of a new token.

machine learning, natural language, specinfer, (19 more...)

arXiv.org Artificial Intelligence

2305.09781

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback