Not enough data to create a plot.
Try a different view from the menu above.
Xue, Chun Jason
RAEE: A Training-Free Retrieval-Augmented Early Exiting Framework for Efficient Inference
Huang, Lianming, Wu, Shangyu, Cui, Yufei, Xiong, Ying, Liu, Xue, Kuo, Tei-Wei, Guan, Nan, Xue, Chun Jason
Deploying large language model inference remains challenging due to their high computational overhead. Early exiting accelerates model inference by adaptively reducing the number of inference layers. Existing methods require training internal classifiers to determine whether to exit at each intermediate layer. However, such classifier-based early exiting frameworks require significant effort to design and train the classifiers. To address these limitations, this paper proposes RAEE, a training-free Retrieval-Augmented Early Exiting framework for efficient inference. First, this paper demonstrates that the early exiting problem can be modeled as a distribution prediction problem, where the distribution is approximated using similar data's existing information. Next, the paper details the process of collecting existing information to build the retrieval database. Finally, based on the pre-built retrieval database, RAEE leverages the retrieved similar data's exiting information to guide the backbone model to exit at the layer, which is predicted by the approximated distribution. Experimental results demonstrate that the proposed RAEE can significantly accelerate inference. RAEE also achieves state-of-the-art zero-shot performance on 8 classification tasks.
On the Compressibility of Quantized Large Language Models
Mao, Yu, Wang, Weilan, Du, Hongchao, Guan, Nan, Xue, Chun Jason
Deploying Large Language Models (LLMs) on edge or mobile devices offers significant benefits, such as enhanced data privacy and real-time processing capabilities. However, it also faces critical challenges due to the substantial memory requirement of LLMs. Quantization is an effective way of reducing the model size while maintaining good performance. However, even after quantization, LLMs may still be too big to fit entirely into the limited memory of edge or mobile devices and have to be partially loaded from the storage to complete the inference. In this case, the I/O latency of model loading becomes the bottleneck of the LLM inference latency. In this work, we take a preliminary step of studying applying data compression techniques to reduce data movement and thus speed up the inference of quantized LLM on memory-constrained devices. In particular, we discussed the compressibility of quantized LLMs, the trade-off between the compressibility and performance of quantized LLMs, and opportunities to optimize both of them jointly.
Pre-processing matters: A segment search method for WSI classification
Wang, Jun, Cui, Yufei, Mao, Yu, Guan, Nan, Xue, Chun Jason
Pre-processing for whole slide images can affect classification performance both in the training and inference stages. Our study analyzes the impact of pre-processing parameters on inference and training across single- and multiple-domain datasets. However, searching for an optimal parameter set is time-consuming. To overcome this, we propose a novel Similarity-based Simulated Annealing approach for fast parameter tuning to enhance inference performance on single-domain data. Our method demonstrates significant performance improvements in accuracy, which raise accuracy from 0.512 to 0.847 in a single domain. We further extend our insight into training performance in multi-domain data by employing a novel Bayesian optimization to search optimal pre-processing parameters, resulting in a high AUC of 0.967. We highlight that better pre-processing for WSI can contribute to further accuracy improvement in the histology area.
Improving Natural Language Understanding with Computation-Efficient Retrieval Representation Fusion
Wu, Shangyu, Xiong, Ying, Cui, Yufei, Liu, Xue, Tang, Buzhou, Kuo, Tei-Wei, Xue, Chun Jason
Retrieval-based augmentations that aim to incorporate knowledge from an external database into language models have achieved great success in various knowledge-intensive (KI) tasks, such as question-answering and text generation. However, integrating retrievals in non-knowledge-intensive (NKI) tasks, such as text classification, is still challenging. Existing works focus on concatenating retrievals to inputs as context to form the prompt-based inputs. Unfortunately, such methods require language models to have the capability to handle long texts. Besides, inferring such concatenated data would also consume a significant amount of computational resources. To solve these challenges, we propose \textbf{ReFusion} in this paper, a computation-efficient \textbf{Re}trieval representation \textbf{Fusion} with neural architecture search. The main idea is to directly fuse the retrieval representations into the language models. Specifically, we first propose an online retrieval module that retrieves representations of similar sentences. Then, we present a retrieval fusion module including two effective ranking schemes, i.e., reranker-based scheme and ordered-mask-based scheme, to fuse the retrieval representations with hidden states. Furthermore, we use Neural Architecture Search (NAS) to seek the optimal fusion structure across different layers. Finally, we conduct comprehensive experiments, and the results demonstrate our ReFusion can achieve superior and robust performance on various NKI tasks.
Accelerating Monte Carlo Bayesian Inference via Approximating Predictive Uncertainty over Simplex
Cui, Yufei, Yao, Wuguannan, Li, Qiao, Chan, Antoni B., Xue, Chun Jason
Estimating the uncertainty of a Bayesian model has been investigated for decades. The model posterior is almost always intractable, such that approximation is necessary. In many real-world cases, even though a decent estimation of the model posterior is obtained, another approximation is required to compute the predictive distribution over the desired output. A common accurate solution is to use Monte Carlo (MC) integration. However, it needs to maintain a large number of samples, evaluate the model repeatedly and average multiple model outputs. In this paper, we propose a method to approximate the probability distribution over the simplex induced by model posterior, enabling tractable computation of the predictive distribution for classification. The aim is to approximate the induced uncertainty of a specific Bayesian model, meanwhile alleviating the heavy workload of MC integration in testing time. Methodologically, we adapt Wasserstein distance to learn the induced conditional distributions, which is novel for Bayesian learning. The proposed method is universally applicable to Bayesian classification models that allow for posterior sampling. Empirical results validate the strong practical performance of our approach.