compactor
Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores
Chari, Vivek, Van Durme, Benjamin
Modern Large Language Models (LLMs) are increasingly trained to support very large context windows. Unfortunately the ability to use long contexts in generation is complicated by the large memory requirement of the KV cache, which scales linearly with the context length. This memory footprint is often the dominant resource bottleneck in real-world deployments, limiting throughput and increasing serving costs. One way to address this is by compressing the KV cache, which can be done either with knowledge of the question being asked (query-aware) or without knowledge of the query (query-agnostic). We present Compactor, a training-free, query-agnostic KV compression strategy that uses approximate leverage scores to determine token importance. We show that Compactor can achieve the same performance as competing methods while retaining 20% fewer tokens in both synthetic and real-world context tasks, while being far more task-robust. We further introduce a procedure for context-calibrated compression: inferring the maximum compression a given context supports before significant performance loss. Using context-calibrated compression, we show that Compactor achieves full KV performance on Longbench while reducing the KV memory burden by 68%, on average. To demonstrate the efficacy and generalizability of our approach, we apply Compactor to 27 synthetic and real-world tasks from RULER and Longbench, with models from both the Qwen 2.5 and Llama 3.1 families.
Weight-Inherited Distillation for Task-Agnostic BERT Compression
Wu, Taiqiang, Hou, Cheng, Zhao, Zhe, Lao, Shanshan, Li, Jiayi, Wong, Ngai, Yang, Yujiu
Knowledge Distillation (KD) is a predominant approach for BERT compression. Previous KD-based methods focus on designing extra alignment losses for the student model to mimic the behavior of the teacher model. These methods transfer the knowledge in an indirect way. In this paper, we propose a novel Weight-Inherited Distillation (WID), which directly transfers knowledge from the teacher. WID does not require any additional alignment loss and trains a compact student by inheriting the weights, showing a new perspective of knowledge distillation. Specifically, we design the row compactors and column compactors as mappings and then compress the weights via structural re-parameterization. Experimental results on the GLUE and SQuAD benchmarks show that WID outperforms previous state-of-the-art KD-based baselines. Further analysis indicates that WID can also learn the attention patterns from the teacher model without any alignment loss on attention distributions.
Learned Interpolation for Better Streaming Quantile Approximation with Worst-Case Guarantees
Schiefer, Nicholas, Chen, Justin Y., Indyk, Piotr, Narayanan, Shyam, Silwal, Sandeep, Wagner, Tal
An $\varepsilon$-approximate quantile sketch over a stream of $n$ inputs approximates the rank of any query point $q$ - that is, the number of input points less than $q$ - up to an additive error of $\varepsilon n$, generally with some probability of at least $1 - 1/\mathrm{poly}(n)$, while consuming $o(n)$ space. While the celebrated KLL sketch of Karnin, Lang, and Liberty achieves a provably optimal quantile approximation algorithm over worst-case streams, the approximations it achieves in practice are often far from optimal. Indeed, the most commonly used technique in practice is Dunning's t-digest, which often achieves much better approximations than KLL on real-world data but is known to have arbitrarily large errors in the worst case. We apply interpolation techniques to the streaming quantiles problem to attempt to achieve better approximations on real-world data sets than KLL while maintaining similar guarantees in the worst case.
Lossless CNN Channel Pruning via Gradient Resetting and Convolutional Re-parameterization
Ding, Xiaohan, Hao, Tianxiang, Liu, Ji, Han, Jungong, Guo, Yuchen, Ding, Guiguang
However, as CNN's representational capacity depends Inspired by the neurobiology research about the independence on the width of conv layers, it is difficult to reduce the of remembering and forgetting, we propose to width without performance drops. On practical CNN architectures re-parameterize a CNN into the remembering parts and forgetting like ResNet-50 [16] and large-scale datasets like parts, where the former learn to maintain the performance ImageNet [6], lossless pruning with high compression ratio and the latter learn for efficiency. By training the has long been considered challenging. For reasonable tradeoff re-parameterized model using regular SGD on the former between compression ratio and performance, a typical but a novel update rule with penalty gradients on the latter, paradigm (Figure 1.A) [2, 3, 9, 30, 33, 56, 57] seeks to train we realize structured sparsity, enabling us to equivalently the model with magnitude-related penalty loss (e.g., group convert the re-parameterized model into the original architecture Lasso [51, 54]) on the conv kernels to produce structured with narrower layers.
Discrepancy, Coresets, and Sketches in Machine Learning
This paper defines the notion of class discrepancy for families of functions. It shows that low discrepancy classes admit small offline and streaming coresets. We provide general techniques for bounding the class discrepancy of machine learning problems. As corollaries of the general technique we bound the discrepancy (and therefore coreset complexity) of logistic regression, sigmoid activation loss, matrix covariance, kernel density and any analytic function of the dot product or the squared distance. Our results prove the existence of epsilon-approximation O(sqrt{d}/epsilon) sized coresets for the above problems. This resolves the long-standing open problem regarding the coreset complexity of Gaussian kernel density estimation. We provide two more related but independent results. First, an exponential improvement of the widely used merge-and-reduce trick which gives improved streaming sketches for any low discrepancy problem. Second, an extremely simple deterministic algorithm for finding low discrepancy sequences (and therefore coresets) for any positive semi-definite kernel. This paper establishes some explicit connections between class discrepancy, coreset complexity, learnability, and streaming algorithms.