Goto

Collaborating Authors

 ttn


Attention: Fast and Accurate Sparse Attention Inference by Delta Correction

Neural Information Processing Systems

The attention mechanism of a transformer has a quadratic complexity, leading to high inference costs and latency for long sequences. However, attention matrices are mostly sparse, which implies that many entries may be omitted from computation for efficient inference. Sparse attention inference methods aim to reduce this computational burden; however, they also come with a troublesome performance degradation. We discover that one reason for this degradation is that the sparse calculation induces a distributional shift in the attention outputs. The distributional shift causes decoding-time queries to fail to align well with the appropriate keys from the prefill stage, leading to a drop in performance. We propose a simple, novel, and effective procedure for correcting this distributional shift, bringing the distribution of sparse attention outputs closer to that of quadratic attention. Our method can be applied on top of any sparse attention method, and results in an average 36%pt performance increase, recovering 88% of quadratic attention accuracy on the 131KRULER benchmark when applied on top of sliding window attention with sink tokens while only adding a small overhead. Our method can maintain approximately 98.5% sparsity over full quadratic attention, making our model 32 times faster than Flash Attention 2 when processing 1M token prefills.


Test Time Adaptation Using Adaptive Quantile Recalibration

arXiv.org Artificial Intelligence

Domain adaptation is a key strategy for enhancing the generalizability of deep learning models in real-world scenarios, where test distributions often diverge significantly from the training domain. However, conventional approaches typically rely on prior knowledge of the target domain or require model retraining, limiting their practicality in dynamic or resource-constrained environments. Recent test-time adaptation methods based on batch normalization statistic updates allow for unsupervised adaptation, but they often fail to capture complex activation distributions and are constrained to specific normalization layers. We propose Adaptive Quantile Recalibration (AQR), a test-time adaptation technique that modifies pre-activation distributions by aligning quantiles on a channel-wise basis. AQR captures the full shape of activation distributions and generalizes across architectures employing BatchNorm, GroupNorm, or LayerNorm. To address the challenge of estimating distribution tails under varying batch sizes, AQR incorporates a robust tail calibration strategy that improves stability and precision. Our method leverages source-domain statistics computed at training time, enabling unsupervised adaptation without retraining models. Experiments on CIFAR-10-C, CIFAR-100-C, and ImageNet-C across multiple architectures demonstrate that AQR achieves robust adaptation across diverse settings, outperforming existing test-time adaptation baselines. These results highlight AQR's potential for deployment in real-world scenarios with dynamic and unpredictable data distributions.


Riemannian Optimization on Tree Tensor Networks with Application in Machine Learning

arXiv.org Artificial Intelligence

Tree tensor networks (TTNs) are widely used in low-rank approximation and quantum many-body simulation. In this work, we present a formal analysis of the differential geometry underlying TTNs. Building on this foundation, we develop efficient first- and second-order optimization algorithms that exploit the intrinsic quotient structure of TTNs. Additionally, we devise a backpropagation algorithm for training TTNs in a kernel learning setting. We validate our methods through numerical experiments on a representative machine learning task.


APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs

arXiv.org Artificial Intelligence

While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate attention mechanisms, still fall short of delivering optimal inference efficiency. This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously. APB introduces a communication mechanism for essential key-value pairs within a sequence parallelism framework, enabling a faster inference speed while maintaining task performance. We implement APB by incorporating a tailored FlashAttn kernel alongside optimized distribution strategies, supporting diverse models and parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively, without any observable task performance degradation. We provide the implementation and experiment code of APB in https://github.com/thunlp/APB.


On Mechanistic Circuits for Extractive Question-Answering

arXiv.org Artificial Intelligence

Large language models are increasingly used to process documents and facilitate question-answering on them. In our paper, we extract mechanistic circuits for this real-world language modeling task: context-augmented language modeling for extractive question-answering (QA) tasks and understand the potential benefits of circuits towards downstream applications such as data attribution to context information. We extract circuits as a function of internal model components (e.g., attention heads, MLPs) using causal mediation analysis techniques. Leveraging the extracted circuits, we first understand the interplay between the model's usage of parametric memory and retrieved context towards a better mechanistic understanding of context-augmented language models. We then identify a small set of attention heads in our circuit which performs reliable data attribution by default, thereby obtaining attribution for free in just the model's forward pass. Using this insight, we then introduce ATTNATTRIB, a fast data attribution algorithm which obtains state-of-the-art attribution results across various extractive QA benchmarks. Finally, we show the possibility to steer the language model towards answering from the context, instead of the parametric memory by using the attribution from ATTNATTRIB as an additional signal during the forward pass. Beyond mechanistic understanding, our paper provides tangible applications of circuits in the form of reliable data attribution and model steering.


Ultra-low latency quantum-inspired machine learning predictors implemented on FPGA

arXiv.org Artificial Intelligence

Tensor Networks (TNs) are a computational paradigm used for representing quantum many-body systems. Recent works have shown how TNs can also be applied to perform Machine Learning (ML) tasks, yielding comparable results to standard supervised learning techniques. In this work, we study the use of Tree Tensor Networks (TTNs) in high-frequency real-time applications by exploiting the low-latency hardware of the Field-Programmable Gate Array (FPGA) technology. We present different implementations of TTN classifiers, capable of performing inference on classical ML datasets as well as on complex physics data. A preparatory analysis of bond dimensions and weight quantization is realized in the training phase, together with entanglement entropy and correlation measurements, that help setting the choice of the TTN architecture. The generated TTNs are then deployed on a hardware accelerator; using an FPGA integrated into a server, the inference of the TTN is completely offloaded. Eventually, a classifier for High Energy Physics (HEP) applications is implemented and executed fully pipelined with sub-microsecond latency.


Classical versus Quantum: comparing Tensor Network-based Quantum Circuits on LHC data

arXiv.org Artificial Intelligence

Tensor Networks (TN) are approximations of high-dimensional tensors designed to represent locally entangled quantum many-body systems efficiently. This study provides a comprehensive comparison between classical TNs and TN-inspired quantum circuits in the context of Machine Learning on highly complex, simulated LHC data. We show that classical TNs require exponentially large bond dimensions and higher Hilbert-space mapping to perform comparably to their quantum counterparts. While such an expansion in the dimensionality allows better performance, we observe that, with increased dimensionality, classical TNs lead to a highly flat loss landscape, rendering the usage of gradient-based optimization methods highly challenging. Furthermore, by employing quantitative metrics, such as the Fisher information and effective dimensions, we show that classical TNs require a more extensive training sample to represent the data as efficiently as TN-inspired quantum circuits. We also engage with the idea of hybrid classical-quantum TNs and show possible architectures to employ a larger phase-space from the data. We offer our results using three main TN ansatz: Tree Tensor Networks, Matrix Product States, and Multi-scale Entanglement Renormalisation Ansatz.


QTN-VQC: An End-to-End Learning framework for Quantum Neural Networks

arXiv.org Artificial Intelligence

The state-of-the-art machine learning (ML), particularly based on deep neural networks (DNN), has enabled a wide spectrum of successful applications ranging from the everyday deployment of speech recognition [1] and computer vision [2] through to the frontier of scientific research in synthetic biology [8]. Despite rapid theoretical and empirical progress in DNN based regression and classification [9], DNN training algorithms are computationally expensive for many new scientific applications, such as new drug discovery [10], which requires computational resources that are beyond the computational limits of classical hardwares [11]. Fortunately, the imminent advent of quantum computing devices opens up new possibilities of exploiting quantum machine learning (QML) [12, 13, 14, 15, 16, 17] to improve the computational efficiency of ML algorithms in the new scientific domains. Although the exploitation of quantum computing devices to carry out QML is still in its initial exploratory stages, the rapid development in quantum hardware has motivated advances in quantum neural networks (QNN) to run in noisy intermediate-scale quantum (NISQ) devices [18, 19, 20, 21]. A NISQ device means that not enough qubits could be spared for quantum error correction, and the imperfect qubits have to be directly used at the physical layer.


Quantum-inspired Machine Learning on high-energy physics data

arXiv.org Machine Learning

Tensor Networks, a numerical tool originally designed for simulating quantum many-body systems, have recently been applied to solve Machine Learning problems. Exploiting a tree tensor network, we apply a quantum-inspired machine learning technique to a very important and challenging big data problem in high energy physics: the analysis and classification of data produced by the Large Hadron Collider at CERN. In particular, we present how to effectively classify so-called b-jets, jets originating from b-quarks from proton-proton collisions in the LHCb experiment, and how to interpret the classification results. We exploit the Tensor Network approach to select important features and adapt the network geometry based on information acquired in the learning process. Finally, we show how to adapt the tree tensor network to achieve optimal precision or fast response in time without the need of repeating the learning process. These results pave the way to the implementation of high-frequency real-time applications, a key ingredient needed among others for current and future LHCb event classification able to trigger events at the tens of MHz scale.


Machine Learning for Metal Casting - Hype or Opportunity?

#artificialintelligence

Let's set the hype and anti-hype of machine learning aside and discuss the opportunities it can provide to the field of metal casting. But what challenges must be overcome first? The Machine Learning (ML) hype is real. The term is appearing everywhere. You hear phrases like "AI is the new electricity."