kurtosis
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Africa > Sudan > Khartoum State > Khartoum (0.05)
- (7 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)
- Information Technology > Sensing and Signal Processing > Image Processing (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
Adaptive Layer-Wise Transformations for Post-Training Quantization of Large Language Models
Pham, Cuong, Dung, Hoang Anh, Nguyen, Cuong C., Le, Trung, Carneiro, Gustavo, Cai, Jianfei, Do, Thanh-Toan
Large language models require significant computational resources for deployment, making quantization essential for practical applications. However, the main obstacle to effective quantization lies in systematic outliers in activations and weights, which cause substantial LLM performance degradation, especially at low-bit settings. While existing transformation-based methods like affine and rotation transformations successfully mitigate outliers, they apply the homogeneous transformation setting, i.e., using the same transformation types across all layers, ignoring the heterogeneous distribution characteristics within LLMs. In this paper, we propose an adaptive transformation selection framework that systematically determines optimal transformations on a per-layer basis. To this end, we first formulate transformation selection as a differentiable optimization problem to achieve the accurate transformation type for each layer. However, searching for optimal layer-wise transformations for every model is computationally expensive. To this end, we establish the connection between weight distribution kurtosis and accurate transformation type. Specifically, we propose an outlier-guided layer selection method using robust $z$-score normalization that achieves comparable performance to differentiable search with significantly reduced overhead. Comprehensive experiments on LLaMA family models demonstrate that our adaptive approach consistently outperforms the widely-used fixed transformation settings. For example, our method achieves an improvement of up to 4.58 perplexity points and a 2.11% gain in average six-task zero-shot accuracy under aggressive W3A3K2V2 quantization settings for the LLaMA-3-8B model compared to the current best existing method, FlatQuant, demonstrating the necessity of heterogeneous transformation selection for optimal LLM quantization.
- Oceania > Australia (0.04)
- Europe > United Kingdom > England > Surrey (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
A Scale Free Algorithm for Stochastic Bandits with Bounded Kurtosis
Existing strategies for finite-armed stochastic bandits mostly depend on a parameter of scale that must be known in advance. Sometimes this is in the form of a bound on the payoffs, or the knowledge of a variance or subgaussian parameter. The notable exceptions are the analysis of Gaussian bandits with unknown mean and variance by Cowan et al. [2015] and of uniform distributions with unknown support [Cowan and Katehakis, 2015]. The results derived in these specialised cases are generalised here to the non-parametric setup, where the learner knows only a bound on the kurtosis of the noise, which is a scale free measure of the extremity of outliers.
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.47)
- North America > Canada (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > Middle East > Israel > Haifa District > Haifa (0.04)
3948ead63a9f2944218de038d8934305-AuthorFeedback.pdf
"The authors provided strong reasoning behind why a uniform shape is beneficial"; "The paper is easy to follow"; "Authors did enough experiments on different data sets and different neural networks"; Below we address the main suggestions for improvements. "There is little explanation about the impact of Kurtois to the activation quantization." "...a solution that can easily modify the step size to become a power of two would be very desirable." "Is there any particular reason of choosing Kurtosis over other statistical measure, such as coefficient of variation?" "In table 1, it can be observed that from 4-bit quantization to 3-bit quantization, the performance drops a lot. "No experimental parameter settings are provided, and no comprehensive comparison with the latest SOTA method "I don't get the claim of the title of this paper "One model to rule them all"" -- We store a single set of weights ("one In contrast, we allow for a single model to operate at various quantization levels (e.g., employ a 4-bit variant of the "Second, the comparasion between KURE and the baseline model could be biased in T able 1.
- North America > Mexico > Gulf of Mexico (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > Dominican Republic (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Energy (0.46)
- Education > Educational Setting (0.45)
Gradient-Weight Alignment as a Train-Time Proxy for Generalization in Classification Tasks
Hölzl, Florian A., Rueckert, Daniel, Kaissis, Georgios
Robust validation metrics remain essential in contemporary deep learning, not only to detect overfitting and poor generalization, but also to monitor training dynamics. In the supervised classification setting, we investigate whether interactions between training data and model weights can yield such a metric that both tracks generalization during training and attributes performance to individual training samples. We introduce Gradient-Weight Alignment (GWA), quantifying the coherence between per-sample gradients and model weights. We show that effective learning corresponds to coherent alignment, while misalignment indicates deteriorating generalization. GWA is efficiently computable during training and reflects both sample-specific contributions and dataset-wide learning dynamics. Extensive experiments show that GWA accurately predicts optimal early stopping, enables principled model comparisons, and identifies influential training samples, providing a validation-set-free approach for model analysis directly from the training data.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California > Los Angeles County > Beverly Hills (0.04)
- (5 more...)
Towards Fully FP8 GEMM LLM Training at Scale
Hernández-Cano, Alejandro, Garbaya, Dhia, Schlag, Imanol, Jaggi, Martin
Despite the significant potential of FP8 data formats for large language model (LLM) pre-training, their adoption has been limited due to challenges in maintaining stability at scale. Existing approaches often rely on suboptimal fine-grained FP8 kernels or fall back to higher-precision matrix multiplications (GEMMs) in sensitive components, such as attention projections, compromising potential throughput gains. We introduce a new class of LLM architectures that, for the first time, support FP8 computation for all GEMMs within transformer blocks during both forward and backward passes. This enables unprecedented throughput gains, particularly at scale, while matching the downstream performance of standard BF16 training. Our architecture design reduces large outlier activations, promoting stable long-term FP8 training. In addition, we identify key metrics to monitor low-precision training and predict potential future divergences.
- North America > Mexico > Gulf of Mexico (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Jordan (0.04)
Thought Anchors: Which LLM Reasoning Steps Matter?
Bogdan, Paul C., Macar, Uzay, Nanda, Neel, Conmy, Arthur
Current frontier large-language models rely on reasoning to achieve state-of-the-art performance. Many existing interpretability are limited in this area, as standard methods have been designed to study single forward passes of a model rather than the multi-token computational steps that unfold during reasoning. We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We introduce a black-box method that measures each sentence's counterfactual importance by repeatedly sampling replacement sentences from the model, filtering for semantically different ones, and continuing the chain of thought from that point onwards to quantify the sentence's impact on the distribution of final answers. We discover that certain sentences can have an outsized impact on the trajectory of the reasoning trace and final answer. We term these sentences \textit{thought anchors}. These are generally planning or uncertainty management sentences, and specialized attention heads consistently attend from subsequent sentences to thought anchors. We further show that examining sentence-sentence causal links within a reasoning trace gives insight into a model's behavior. Such information can be used to predict a problem's difficulty and the extent different question domains involve sequential or diffuse reasoning. As a proof-of-concept, we demonstrate that our techniques together provide a practical toolkit for analyzing reasoning models by conducting a detailed case study of how the model solves a difficult math problem, finding that our techniques yield a consistent picture of the reasoning trace's structure. We provide an open-source tool (thought-anchors.com) for visualizing the outputs of our methods on further problems. The convergence across our methods shows the potential of sentence-level analysis for a deeper understanding of reasoning models.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)