AITopics | Cheung, Alvin

Collaborating Authors

Cheung, Alvin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

M\'elange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Griggs, Tyler, Liu, Xiaoxuan, Yu, Jiaxiang, Kim, Doyoung, Chiang, Wei-Lin, Cheung, Alvin, Stoica, Ion

arXiv.org Artificial IntelligenceJun-27-2024

Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and growing landscape of GPU types and, within these options, higher cost does not always lead to increased performance. Instead, through a comprehensive investigation, we find that three key LLM service characteristics (request size, request rate, SLO) strongly influence GPU cost efficiency, and differing GPU types are most cost efficient for differing LLM service settings. As a result, the most cost-efficient allocation for a given service is typically a mix of heterogeneous GPU types. Based on this analysis, we introduce M\'elange, a GPU allocation framework that navigates these diverse LLM service characteristics and heterogeneous GPU option space to automatically and efficiently derive the minimal-cost GPU allocation for a given LLM service. We formulate the GPU allocation task as a cost-aware bin packing problem where GPUs are bins and items are slices of the service workload. Our formulation's constraints account for a service's unique characteristics, allowing M\'elange to be flexible to support diverse service settings and heterogeneity-aware to adapt the GPU allocation to a specific service. Compared to using only a single GPU type, M\'elange reduces deployment costs by up to 77% in conversational settings, 33% in document-based settings, and 51% in a mixed setting.

large language model, machine learning, request size, (18 more...)

arXiv.org Artificial Intelligence

2404.14527

Genre: Research Report (0.64)

Industry: Information Technology (0.46)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks

Gong, Linyuan, Wang, Sida, Elhoushi, Mostafa, Cheung, Alvin

arXiv.org Artificial IntelligenceJun-22-2024

SAFIM emphasizes Language Models (LLMs) on the code Fill-inthe-Middle syntax-aware completion within code's Abstract Syntax (FIM) task. This benchmark focuses Tree (AST), targeting algorithmic blocks, control-flow expressions, on syntax-aware completions of program structures and API function calls, unlike existing Fillin-the such as code blocks and conditional expressions, Middle (FIM) benchmarks such as HumanEval-and includes 17,720 examples from multiple Infilling (Bavarian et al., 2022), which are based on filling programming languages, sourced from recent randomly masked lines or character spans. SAFIM is code submissions after April 2022 to minimize sourced from code on Codeforces and GitHub created after data contamination. SAFIM provides a April 2022, deliberately aiming to avoid overlap with mainstream robust framework with various prompt designs open-source pretraining corpora like The Stack (Kocetkov and novel syntax-aware post-processing techniques, et al., 2022). This approach reduces the risks of facilitating accurate and fair comparisons data contamination caused by memoization of test cases, across LLMs.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2403.04814

Country: Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

Gong, Linyuan, Elhoushi, Mostafa, Cheung, Alvin

arXiv.org Artificial IntelligenceFeb-9-2024

Large language models (LLMs) have made significant advancements in code-related tasks, yet many LLMs treat code as simple sequences, neglecting its structured nature. We introduce AST-T5, a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) for enhanced code generation, transpilation, and understanding. Using dynamic programming, our AST-Aware Segmentation retains code structure, while our AST-Aware Span Corruption objective equips the model to reconstruct various code structures. Unlike other models, AST-T5 avoids intricate program analyses or architectural changes, so it integrates seamlessly with any encoder-decoder Transformer. Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks. Structure-awareness makes AST-T5 particularly powerful in code-to-code tasks, surpassing CodeT5 by 2 points in exact match score for the Bugs2Fix task and by 3 points in exact match score for Java-C# Transpilation in CodeXGLUE. Our code and model are publicly available at https://github.com/gonglinyuan/ast_t5.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2401.03003

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Online Speculative Decoding

Liu, Xiaoxuan, Hu, Lanxiang, Bailis, Peter, Stoica, Ion, Deng, Zhijie, Cheung, Alvin, Zhang, Hao

arXiv.org Artificial IntelligenceOct-17-2023

Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduce online speculative decoding (OSD) to address this challenge. The main idea is to continually update (multiple) draft model(s) on observed user query data using the abundant excess computational power in an LLM serving cluster. Given that LLM inference is memory-bounded, the surplus computational power in a typical LLM serving cluster can be repurposed for online retraining of draft models, thereby making the training cost-neutral. Since the query distribution of an LLM service is relatively simple, retraining on query distribution enables the draft model to more accurately predict the target model's outputs, particularly on data originating from query distributions. As the draft model evolves online, it aligns with the query distribution in real time, mitigating distribution shifts. We develop a prototype of online speculative decoding based on online knowledge distillation and evaluate it using both synthetic and real query data on several popular LLMs. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, which translates into 1.22x to 3.06x latency reduction.

large language model, natural language, online speculative decoding, (1 more...)

arXiv.org Artificial Intelligence

2310.07177

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

An Evaluation of Memory Optimization Methods for Training Neural Networks

Liu, Xiaoxuan, Jha, Siddharth, Cheung, Alvin

arXiv.org Artificial IntelligenceJun-4-2023

As models continue to grow in size, the development of memory optimization methods (MOMs) has emerged as a solution to address the memory bottleneck encountered when training large models. To comprehensively examine the practical value of various MOMs, we have conducted a thorough analysis of existing literature from a systems perspective. Our analysis has revealed a notable challenge within the research community: the absence of standardized metrics for effectively evaluating the efficacy of MOMs. The scarcity of informative evaluation metrics hinders the ability of researchers and practitioners to compare and benchmark different approaches reliably. Consequently, drawing definitive conclusions and making informed decisions regarding the selection and application of MOMs becomes a challenging endeavor. To address the challenge, this paper summarizes the scenarios in which MOMs prove advantageous for model training. We propose the use of distinct evaluation metrics under different scenarios. By employing these metrics, we evaluate the prevailing MOMs and find that their benefits are not universal. We present insights derived from experiments and discuss the circumstances in which they can be advantageous.

artificial intelligence, machine learning, throughput, (17 more...)

arXiv.org Artificial Intelligence

2303.14633

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training Dynamics

Ardakani, Arash, Haan, Altan, Tan, Shangyin, Popovici, Doru Thom, Cheung, Alvin, Iancu, Costin, Sen, Koushik

arXiv.org Artificial IntelligenceMay-29-2023

Transformer-based models, such as BERT and ViT, have achieved state-of-the-art results across different natural language processing (NLP) and computer vision (CV) tasks. However, these models are extremely memory intensive during their fine-tuning process, making them difficult to deploy on GPUs with limited memory resources. To address this issue, we introduce a new tool called SlimFit that reduces the memory requirements of these models by dynamically analyzing their training dynamics and freezing less-contributory layers during fine-tuning. The layers to freeze are chosen using a runtime inter-layer scheduling algorithm. SlimFit adopts quantization and pruning for particular layers to balance the load of dynamic activations and to minimize the memory footprint of static activations, where static activations refer to those that cannot be discarded regardless of freezing. This allows SlimFit to freeze up to 95% of layers and reduce the overall on-device GPU memory usage of transformer-based models such as ViT and BERT by an average of 2.2x, across different NLP and CV benchmarks/datasets such as GLUE, SQuAD 2.0, CIFAR-10, CIFAR-100 and ImageNet with an average degradation of 0.2% in accuracy. For such NLP and CV tasks, SlimFit can reduce up to 3.1x the total on-device memory usage with an accuracy degradation of only up to 0.4%. As a result, while fine-tuning of ViT on ImageNet and BERT on SQuAD 2.0 with a batch size of 128 requires 3 and 2 32GB GPUs respectively, SlimFit enables their fine-tuning on a single 32GB GPU without any significant accuracy degradation.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2305.18513

Country:

Europe (0.28)
North America > United States > California (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

Gong, Linyuan, Xiong, Chenyan, Liu, Xiaodong, Bajaj, Payal, Xie, Yiqing, Cheung, Alvin, Gao, Jianfeng, Song, Xia

arXiv.org Artificial IntelligenceMay-21-2023

This paper explores the effectiveness of model-generated signals in improving zero-shot generalization of text-to-text Transformers such as T5. We study various designs to pretrain T5 using an auxiliary model to construct more challenging token replacements for the main model to denoise. Key aspects under study include the decoding target, the location of the RTD head, and the masking pattern. Based on these studies, we develop a new model, METRO-T0, which is pretrained using the redesigned ELECTRA-Style pretraining strategies and then prompt-finetuned on a mixture of NLP tasks. METRO-T0 outperforms all similar-sized baselines on prompted NLP benchmarks, such as T0 Eval and MMLU, and rivals the state-of-the-art T0-11B model with only 8% of its parameters. Our analysis on model's neural activation and parameter sensitivity reveals that the effectiveness of METRO-T0 stems from more balanced contribution of parameters and better utilization of their capacity. The code and model checkpoints are available at https://github.com/gonglinyuan/metro_t0.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2023.acl-long.724

2305.12567

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

ADELT: Transpilation Between Deep Learning Frameworks

Gong, Linyuan, Wang, Jiayi, Cheung, Alvin

arXiv.org Artificial IntelligenceMar-6-2023

We propose Adversarial DEep Learning Transpiler (ADELT) for source-to-source transpilation between deep learning frameworks. Unlike prior approaches, we decouple the transpilation of code skeletons and the mapping of API keywords (an API function name or a parameter name). ADELT transpile code skeletons using few-shot prompting on big language models. Based on contextual embeddings extracted by a BERT for code, we train aligned API embeddings in a domain-adversarial setup, upon which we generate a dictionary for keyword translation. The model is trained on our unlabeled DL corpus from web crawl data, without using any hand-crafted rules and parallel data. Our method outperforms state-of-the-art transpilers on multiple transpilation pairs including PyTorch-Keras and PyTorch-MXNet by 15.9pts and 12.0pts in exact match scores respectively.

api keyword, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2303.03593

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

GACT: Activation Compressed Training for Generic Network Architectures

Liu, Xiaoxuan, Zheng, Lianmin, Wang, Dequan, Cen, Yukuo, Chen, Weize, Han, Xu, Chen, Jianfei, Liu, Zhiyuan, Tang, Jie, Gonzalez, Joey, Mahoney, Michael, Cheung, Alvin

arXiv.org Artificial IntelligenceSep-3-2022

Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint. This paper presents GACT, an ACT framework to support a broad range of machine learning tasks for generic NN architectures with limited domain knowledge. By analyzing a linearized version of ACT's approximate gradient, we prove the convergence of GACT without prior knowledge on operator type or model architecture. To make training stable, we propose an algorithm that decides the compression ratio for each tensor by estimating its impact on the gradient at run time. We implement GACT as a PyTorch library that readily applies to any NN architecture. GACT reduces the activation memory for convolutional NNs, transformers, and graph NNs by up to 8.1x, enabling training with a 4.2x to 24.7x larger batch size, with negligible accuracy loss. We implement GACT as a PyTorch library at https://github.com/LiuXiaoxuanPKU/GACT-ICML.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2206.11357

Country: North America > United States > California (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

NumS: Scalable Array Programming for the Cloud

Elibol, Melih, Benara, Vinamra, Yagati, Samyu, Zheng, Lianmin, Cheung, Alvin, Jordan, Michael I., Stoica, Ion

arXiv.org Artificial IntelligenceJul-12-2022

Scientists increasingly rely on Python tools to perform scalable distributed memory array operations using rich, NumPy-like expressions. However, many of these tools rely on dynamic schedulers optimized for abstract task graphs, which often encounter memory and network bandwidth-related bottlenecks due to sub-optimal data and operator placement decisions. Tools built on the message passing interface (MPI), such as ScaLAPACK and SLATE, have better scaling properties, but these solutions require specialized knowledge to use. In this work, we present NumS, an array programming library which optimizes NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS). LSHS is a local search method which optimizes operator placement by minimizing maximum memory and network load on any given node within a distributed system. Coupled with a heuristic for load balanced data layouts, our approach is capable of attaining communication lower bounds on some common numerical operations, and our empirical study shows that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem. On terabyte-scale data, NumS achieves competitive performance to SLATE on DGEMM, up to 20x speedup over Dask on a key operation for tensor factorization, and a 2x speedup on logistic regression compared to Dask ML and Spark's MLlib.

artificial intelligence, machine learning, opération, (18 more...)

arXiv.org Artificial Intelligence

2206.14276

Country: North America > United States (0.93)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.69)

Add feedback