Goto

Collaborating Authors

 approximation error


Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency

Neural Information Processing Systems

Linear attention has attracted interest as a computationally efficient approximation to softmax attention, especially for long sequences. Recent studies have explored distilling softmax attention in pre-trained Transformers into linear attention. However, a critical challenge remains: how to choose the feature dimension that governs the approximation quality. Existing methods fix this dimension uniformly across all attention layers, overlooking the diverse roles and complexities of them. In this paper, we propose a principled method to automatically determine the feature dimension in linear attention using the concept of statistical degrees of freedom, which represent the effective dimensionality of the inputs. We provide a theoretical bound on the approximation error and show that the dimension chosen by our method achieves smaller errors under a fixed computational budget. Furthermore, we introduce an efficient layerwise training strategy to learn nonlinear features tailored to each layer. Experiments on multiple pre-trained transformers demonstrate that our method improves the performance of distilled models compared to baselines without increasing the inference cost. Our findings also provide insight into how the complexity of the attention mechanism evolves across layers.


Stability and Oracle Inequalities for Optimal Transport Maps between General Distributions

Neural Information Processing Systems

Optimal transport (OT) provides a powerful framework for comparing and transforming probability distributions, with wide applications in generative modeling, AI4Science and statistical inference. However, existing estimation theory typically requires stringent smoothness conditions on the underlying Brenier potentials and assumes bounded distribution supports, limiting practical applicability. In this paper, we introduce a unified theoretical framework for semi-dual OT map estimation that relaxes both of these restrictions. Building on sieved convex conjugate, our framework has two key contributions: (i) a new map stability bounds that holds without any second-order regularity assumptions on the true Brenier potentials, and (ii) an oracle inequality that cleanly decomposes the estimation error into statistical error, sieved bias, and approximation error. Specifically, our approximation error is measured in the L1 norm rather than Sobolev norm in the existing results, aligning more naturally with classical approximation theory. Leveraging these tools, we provide statistical error of semi-dual estimators with mild and verifiable conditions on the true OT map. Moreover, we establish the first theoretical guarantee for deep neural network OT map estimator between general distributions, with Tanh network function class as an example.


p-value Adjustment for Monotonous, Unbiased, and Fast Clustering Comparison

Neural Information Processing Systems

Popular metrics for clustering comparison, like the Adjusted Rand Index and the Adjusted Mutual Information, are type II biased. The Standardized Mutual Information removes this bias but suffers from counterintuitive non-monotonicity and poor computational efficiency. We introduce the p-value adjusted Rand Index (PMI2), the first cluster comparison method that is type II unbiased and provably monotonous. The PMI2 has fast approximations that outperform the Standardized Mutual information. We demonstrate its unbiased clustering selection, approximation quality, and runtime efficiency on synthetic benchmarks. In experiments on image and social network datasets, we show how the PMI2 can help practitioners choose better clustering and community detection algorithms.


Continual Release Moment Estimation with Differential Privacy

Neural Information Processing Systems

We propose Joint Moment Estimation (JME), a method for continually and privately estimating both the first and second moments of a data stream with reduced noise compared to naive approaches. JME supports the matrix mechanism and exploits a joint sensitivity analysis to identify a privacy regime in which the second-moment estimation incurs no additional privacy cost, thereby improving accuracy while maintaining privacy. We demonstrate JME's effectiveness in two applications: estimating the running mean and covariance matrix for Gaussian density estimation and model training with DP-Adam.


Fast Rank-1 Lattice Targeted Sampling for Black-box Optimization Anonymous Author(s) Affiliation Address email

Neural Information Processing Systems

Black-box optimization has gained great attention for its success in recent ap-1 plications. However, scaling up to high-dimensional problems with good query2 efficiency remains challenging. This paper proposes a novel Rank-1 Lattice Tar-3 geted Sampling (RLTS) technique to address this issue. Our RLTS benefits from4 random rank-1 lattice Quasi-Monte Carlo, which enables us to perform fast local5 exact Gaussian processes (GP) training and inference with O(nlogn)complexity6 w.r.t.



Stability and Oracle Inequalities for Optimal Transport Maps between General Distributions

Neural Information Processing Systems

Optimal transport (OT) provides a powerful framework for comparing and transforming probability distributions, with wide applications in generative modeling, AI4Science and statistical inference. However, existing estimation theory typically requires stringent smoothness conditions on the underlying Brenier potentials and assumes bounded distribution supports, limiting practical applicability. In this paper, we introduce a unified theoretical framework for semi-dual OT map estimation that relaxes both of these restrictions. Building on sieved convex conjugate, our framework has two key contributions: (i) a new map stability bounds that holds without any second-order regularity assumptions on the true Brenier potentials, and (ii) an oracle inequality that cleanly decomposes the estimation error into statistical error, sieved bias, and approximation error. Specifically, our approximation error is measured in the $L^\infty$ norm rather than Sobolev norm in the existing results, aligning more naturally with classical approximation theory. Leveraging these tools, we provide statistical error of semi-dual estimators with mild and verifiable conditions on the true OT map. Moreover, we establish the first theoretical guarantee for deep neural network OT map estimator between general distributions, with Tanh network function class as an example.


Adaptivity Under Realizability Constraints: Comparing In-Context and Agentic Learning

arXiv.org Machine Learning

We compare in-context learning with fixed queries and agentic learning with adaptive queries for uniform approximation of task families. We consider two settings: an unrestricted regime, where querying and approximation are arbitrary functions, and a realizable regime, where we require these operations to be implemented by ReLU neural networks. In both settings, adaptivity never hinders approximation performance. However, this advantage can change when one passes from the unrestricted regime to the realizable regime. We identify four distinct approximation scenarios, each witnessed by an explicit task family: (a) no advantage of adaptivity; (b) an advantage in the unrestricted regime that persists under ReLU realizability; (c) an advantage that arises only under realizability; and (d) an advantage that disappears under realizability. This demonstrates that representational constraints interact profoundly with the effect of adaptivity.


ABiased Graph Neural Network Sampler with Near-Optimal Regret

Neural Information Processing Systems

Graph neural networks (GNN) have recently emerged as a vehicle for applying deep network architectures to graph and relational data. However, given the increasing size of industrial datasets, in many practical situations the message passing computations required for sharing information across GNN layers are no longer scalable. Although various sampling methods have been introduced to approximate full-graph training within a tractable budget, there remain unresolved complications such as high variances and limited theoretical guarantees. To address these issues, we build upon existing work and treat GNN neighbor sampling as a multi-armed bandit problem but with a newly-designed reward function that introduces some degree of bias designed to reduce variance and avoid unstable, possibly-unbounded pay outs. And unlike prior bandit-GNN use cases, the resulting policy leads to near-optimal regret while accounting for the GNN training dynamics introduced by SGD.