AITopics | Liu, Shih-Yang

Collaborating Authors

Liu, Shih-Yang

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

Liu, Shih-Yang, Yang, Huck, Wang, Chien-Yi, Fung, Nai Chit, Yin, Hongxu, Sakr, Charbel, Muralidharan, Saurav, Cheng, Kwang-Ting, Kautz, Jan, Wang, Yu-Chiang Frank, Molchanov, Pavlo, Chen, Min-Hung

arXiv.org Artificial IntelligenceNov-21-2024

Although Large Language Models (LLMs) exhibit superior performance across diverse applications, their empirical deployment remains challenging due to their associated considerable model size and high inference costs. To mitigate these emerging challenges, model compression research such as post-training compression (Ashkboos et al., 2024; Ma et al., 2023) and compression-aware training (Alvarez & Salzmann, 2017; Lym et al., 2019; Liu et al., 2024, 2023c) has been extensively explored to reduce the computational resource demands of serving LLMs (Zhu et al., 2023). However, most existing methods either incur significant accuracy degradation compared to uncompressed models or have high training time. Additionally, their flexibility is often limited by a discrete set of compression formats (e.g., 2:4 sparsity, 3/4-bit quantization), making it challenging to meet the diverse capacity and efficiency requirements of different users. To overcome the above flexibility limitation, we re-formulate the model compression problem into the customized compensation problem: Given a compressed model, we aim to introduce residual low-rank paths to compensate for compression errors under customized requirements from users, such as tasks, compression ratios, etc. Rather than focusing solely on producing compressed models with minimal performance degradation, by incorporating these residual paths, the compensated model gains greater flexibility in adjusting overall capacity, without being constrained by specific compression formats.

eora, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2410.21271

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Hymba: A Hybrid-head Architecture for Small Language Models

Dong, Xin, Fu, Yonggan, Diao, Shizhe, Byeon, Wonmin, Chen, Zijia, Mahabaleshwarkar, Ameya Sunil, Liu, Shih-Yang, Van Keirsbilck, Matthijs, Chen, Min-Hung, Suhara, Yoshi, Lin, Yingyan, Kautz, Jan, Molchanov, Pavlo

arXiv.org Artificial IntelligenceNov-20-2024

We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.13676

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization

Huang, Xijie, Liu, Zechun, Liu, Shih-Yang, Cheng, Kwang-Ting

arXiv.org Artificial IntelligenceJul-10-2024

Low-Rank Adaptation (LoRA), as a representative Parameter-Efficient Fine-Tuning (PEFT)method, significantly enhances the training efficiency by updating only a small portion of the weights in Large Language Models (LLMs). Recently, weight-only quantization techniques have also been applied to LoRA methods to reduce the memory footprint of fine-tuning. However, applying weight-activation quantization to the LoRA pipeline is under-explored, and we observe substantial performance degradation primarily due to the presence of activation outliers. In this work, we propose RoLoRA, the first LoRA-based scheme for effective weight-activation quantization. RoLoRA utilizes rotation for outlier elimination and proposes rotation-aware fine-tuning to preserve the outlier-free characteristics in rotated LLMs. Experimental results show RoLoRA consistently improves low-bit LoRA convergence and post-training quantization robustness in weight-activation settings. We evaluate RoLoRA across LLaMA2-7B/13B, LLaMA3-8B models, achieving up to 29.5% absolute accuracy gain of 4-bit weight-activation quantized LLaMA2- 13B on commonsense reasoning tasks compared to LoRA baseline. We further demonstrate its effectiveness on Large Multimodal Models (LLaVA-1.5-7B). Codes are available at https://github.com/HuangOwen/RoLoRA

large language model, machine learning, quantization, (17 more...)

arXiv.org Artificial Intelligence

2407.08044

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

DoRA: Weight-Decomposed Low-Rank Adaptation

Liu, Shih-Yang, Wang, Chien-Yi, Yin, Hongxu, Molchanov, Pavlo, Wang, Yu-Chiang Frank, Cheng, Kwang-Ting, Chen, Min-Hung

arXiv.org Artificial IntelligenceJul-9-2024

Among the widely used parameter-efficient fine-tuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA). DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing \ours, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. \ours~consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding. Code is available at https://github.com/NVlabs/DoRA.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2402.09353

Country: Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)

Add feedback

Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers

Dong, Pingcheng, Tan, Yonghao, Zhang, Dong, Ni, Tianwei, Liu, Xuejiao, Liu, Yu, Luo, Peng, Liang, Luhong, Liu, Shih-Yang, Huang, Xijie, Zhu, Huaiyu, Pan, Yun, An, Fengwei, Cheng, Kwang-Ting

arXiv.org Artificial IntelligenceMar-29-2024

The performance greatly benefits from the self-attention mechanism in Transformers, which could capture long-range dependencies Non-linear functions are prevalent in Transformers and their lightweight well, but with a substantial overhead in computation variants, incurring substantial and frequently underestimated and memory. Extensive research has been conducted to facilitate the hardware costs. Previous state-of-the-art works optimize deployment of Transformers on edge devices. Techniques like lightweight these operations by piece-wise linear approximation and store the structure integrating convolution and linear attention [4, 5] parameters in look-up tables (LUT), but most of them require unfriendly emerge, while quantization [6-8] and run-time pruning [9] has become high-precision arithmetics such as FP/INT 32 and lack consideration favored approaches to further reduced the hardware burden. of integer-only INT quantization. This paper proposed a However, the optimization of non-linear operations is frequently genetic LUT-Approximation algorithm namely GQA-LUT that can neglected in Transformer-based models which can be costly due to automatically determine the parameters with quantization awareness.

machine learning, natural language, quantization, (17 more...)

arXiv.org Artificial Intelligence

2403.19591

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Efficient Quantization-aware Training with Adaptive Coreset Selection

Huang, Xijie, Liu, Zechun, Liu, Shih-Yang, Cheng, Kwang-Ting

arXiv.org Artificial IntelligenceSep-25-2023

The expanding model size and computation of deep neural networks (DNNs) have increased the demand for efficient model deployment methods. Quantization-aware training (QAT) is a representative model compression method to leverage redundancy in weights and activations. However, most existing QAT methods require end-to-end training on the entire dataset, which suffers from long training time and high energy costs. Coreset selection, aiming to improve data efficiency utilizing the redundancy of training data, has also been widely used for efficient training. In this work, we propose a new angle through the coreset selection to improve the training efficiency of quantization-aware training. Based on the characteristics of QAT, we propose two metrics: error vector score and disagreement score, to quantify the importance of each sample during training. Guided by these two metrics of importance, we proposed a quantization-aware adaptive coreset selection (ACS) method to select the data for the current training epoch. We evaluate our method on various networks (ResNet-18, MobileNetV2), datasets(CIFAR-100, ImageNet-1K), and under different quantization settings. Compared with previous coreset selection methods, our method significantly improves QAT performance with different dataset fractions. Our method can achieve an accuracy of 68.39% of 4-bit quantized ResNet-18 on the ImageNet-1K dataset with only a 10% subset, which has an absolute gain of 4.24% compared to the random baseline.

artificial intelligence, machine learning, selection, (16 more...)

arXiv.org Artificial Intelligence

2306.07215

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Oscillation-free Quantization for Low-bit Vision Transformers

Liu, Shih-Yang, Liu, Zechun, Cheng, Kwang-Ting

arXiv.org Artificial IntelligenceJun-2-2023

Weight oscillation is an undesirable side effect of quantization-aware training, in which quantized weights frequently jump between two quantized levels, resulting in training instability and a sub-optimal final model. We discover that the learnable scaling factor, a widely-used $\textit{de facto}$ setting in quantization aggravates weight oscillation. In this study, we investigate the connection between the learnable scaling factor and quantized weight oscillation and use ViT as a case driver to illustrate the findings and remedies. In addition, we also found that the interdependence between quantized weights in $\textit{query}$ and $\textit{key}$ of a self-attention layer makes ViT vulnerable to oscillation. We, therefore, propose three techniques accordingly: statistical weight quantization ($\rm StatsQ$) to improve quantization robustness compared to the prevalent learnable-scale-based method; confidence-guided annealing ($\rm CGA$) that freezes the weights with $\textit{high confidence}$ and calms the oscillating weights; and $\textit{query}$-$\textit{key}$ reparameterization ($\rm QKR$) to resolve the query-key intertwined oscillation and mitigate the resulting gradient misestimation. Extensive experiments demonstrate that these proposed techniques successfully abate weight oscillation and consistently achieve substantial accuracy improvement on ImageNet. Specifically, our 2-bit DeiT-T/DeiT-S algorithms outperform the previous state-of-the-art by 9.8% and 7.7%, respectively. Code and models are available at: https://github.com/nbasyl/OFQ.

low-bit vision transformer, oscillation-free quantization

arXiv.org Artificial Intelligence

2302.0221

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Vision (0.40)

Add feedback