Goto

Collaborating Authors

 Lee, Deokjae


Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment

Lee, Deokjae, Song, Hyun Oh

arXiv.org Artificial Intelligence

We study weight-only post-training quantization (PTQ), which quantizes the weights of a large language model (LLM) without retraining, using little or no calibration data. Weight-only PTQ is crucial for reducing the memory footprint and latency of LLM inference, especially in memory-bound, small-batch inference scenarios, such as personalized inference on edge devices. Despite its importance, irregular weight distributions with heavy-tailed outliers in LLMs complicate quantization, recently motivating rotation-based methods that transform weights into near-Gaussian distributions, which are more regular with fewer outliers, thereby reducing quantization error. In this work, we first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit quantizers approaching the Gaussian distortion-rate bound are essential to achieve near-optimal quantization performance. To bridge this theoretical insight and practical implementation, we introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Furthermore, leveraging Q-Palette as a foundational component, we propose a novel mixed-scheme quantization framework, jointly optimizing quantizer choices and layer fusion decisions given resource constraints. The code is available at https://github.com/snu-mllab/Q-Palette.


GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance

Kim, Jinuk, Halabi, Marwa El, Park, Wonpyo, Schaefer, Clemens JS, Lee, Deokjae, Park, Yeonhong, Lee, Jae W., Song, Hyun Oh

arXiv.org Artificial Intelligence

Post-training quantization is a key technique for reducing the memory and inference latency of large language models by quantizing weights and activations without requiring retraining. However, existing methods either (1) fail to account for the varying importance of hidden features to the end loss or, when incorporating end loss, (2) neglect the critical interactions between model weights. To address these limitations, we propose GuidedQuant, a novel quantization approach that integrates gradient information from the end loss into the quantization objective while preserving cross-weight dependencies within output channels. GuidedQuant consistently boosts the performance of state-of-the-art quantization methods across weight-only scalar, weight-only vector, and weight-and-activation quantization. Additionally, we introduce a novel non-uniform scalar quantization algorithm, which is guaranteed to monotonically decrease the quantization objective value, and outperforms existing methods in this category. We release the code at https://github.com/snu-mllab/GuidedQuant.


Training Greedy Policy for Proposal Batch Selection in Expensive Multi-Objective Combinatorial Optimization

Lee, Deokjae, Song, Hyun Oh, Cho, Kyunghyun

arXiv.org Artificial Intelligence

These problems focus on identifying designs, represented as discrete objects like strings or graphs, that optimize multiple Active learning is increasingly adopted for expensive attributes, often requiring substantial resources for accurate multi-objective combinatorial optimization assessment (Ehrgott, 2005; Gómez-Bombarelli et al., 2016; problems, but it involves a challenging subset Stanton et al., 2022; Winter et al., 2019; Mirhoseini et al., selection problem, optimizing the batch acquisition 2021). Active learning frameworks, which iteratively propose score that quantifies the goodness of a batch of candidates and learn from the attributes a batch for evaluation. Due to the excessively evaluated on those candidates, are increasingly employed in large search space of the subset selection problem, these fields due to their query efficiency, which is a critical prior methods optimize the batch acquisition component to handling expensive evaluation costs (Aggarwal on the latent space, which has discrepancies with et al., 2014; Jain et al., 2022; Gruver et al., 2023; Zhu the actual space, or optimize individual acquisition et al., 2023; Agnesina et al., 2023). In active learning, each scores without considering the dependencies round entails an internal problem of selecting a proposal among candidates in a batch instead of directly batch of candidates for querying, formulated by cardinalityconstrained optimizing the batch acquisition.


Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic Programming

Kim, Jinuk, Jeong, Yeonwoo, Lee, Deokjae, Song, Hyun Oh

arXiv.org Artificial Intelligence

Recent works on neural network pruning advocate that reducing the depth of the network is more effective in reducing run-time memory usage and accelerating inference latency than reducing the width of the network through channel pruning. In this regard, some recent works propose depth compression algorithms that merge convolution layers. However, the existing algorithms have a constricted search space and rely on human-engineered heuristics. In this paper, we propose a novel depth compression algorithm which targets general convolution operations. We propose a subset selection problem that replaces inefficient activation layers with identity functions and optimally merges consecutive convolution operations into shallow equivalent convolution operations for efficient end-to-end inference latency. Since the proposed subset selection problem is NP-hard, we formulate a surrogate optimization problem that can be solved exactly via two-stage dynamic programming within a few seconds. We evaluate our methods and baselines by TensorRT for a fair inference latency comparison. Our method outperforms the baseline method with higher accuracy and faster inference speed in MobileNetV2 on the ImageNet dataset. Specifically, we achieve $1.41\times$ speed-up with $0.11$\%p accuracy gain in MobileNetV2-1.0 on the ImageNet.


Query-Efficient Black-Box Red Teaming via Bayesian Optimization

Lee, Deokjae, Lee, JunYeong, Ha, Jung-Woo, Kim, Jin-Hwa, Lee, Sang-Woo, Lee, Hwaran, Song, Hyun Oh

arXiv.org Artificial Intelligence

The deployment of large-scale generative models is often restricted by their potential risk of causing harm to users in unpredictable ways. We focus on the problem of black-box red teaming, where a red team generates test cases and interacts with the victim model to discover a diverse set of failures with limited query access. Existing red teaming methods construct test cases based on human supervision or language model (LM) and query all test cases in a brute-force manner without incorporating any information from past evaluations, resulting in a prohibitively large number of queries. To this end, we propose Bayesian red teaming (BRT), novel query-efficient black-box red teaming methods based on Bayesian optimization, which iteratively identify diverse positive test cases leading to model failures by utilizing the pre-defined user input pool and the past evaluations. Experimental results on various user input pools demonstrate that our method consistently finds a significantly larger number of diverse positive test cases under the limited query budget than the baseline methods. The source code is available at https://github.com/snu-mllab/Bayesian-Red-Teaming.