AITopics

2607.00479

Country: Europe (0.28)

Genre: Research Report (0.63)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.86)

Liu, Peilin, Zhou, Ding-Xuan

Generalization Analysis of Transformers in Distribution Regression

arXiv.org Machine LearningJun-30-2026

In recent years, models based on the Transformer architecture have seen widespread applications and have become one of the core tools in the field of deep learning. Numerous successful techniques, such as parameter-efficient fine-tuning and efficient scaling, have been proposed surrounding their applications to further enhance performance. However, the success of these strategies has always lacked the support of rigorous mathematical theory. To study the underlying mechanisms behind Transformers and related techniques, we first propose a Transformer learning framework motivated by distribution regression, with distributions being inputs, connect a two-stage sampling process with natural language processing, and present a mathematical formulation of the attention mechanism called attention operator. We demonstrate that by the attention operator, Transformers can compress distributions into function representations without loss of information. Moreover, with the advantages of our novel attention operator, Transformers exhibit a stronger capability to learn functionals with more complex structures than convolutional neural networks and fully connected networks. Finally, we obtain a generalization bound within the distribution regression framework. Through the aforementioned theoretical results, we further discuss some successful techniques emerging with large language models (LLMs), such as prompt tuning, parameter-efficient fine-tuning, and efficient scaling. We also provide theoretical insights behind these techniques within our novel analysis framework.

attention operator, large language model, machine learning, (18 more...)

doi: 10.1162/neco_a_01726

2606.29256

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningJun-26-2026

Learning Probabilistic Filters with Strictly Proper Scoring Rules

Bach, Eviatar, Baptista, Ricardo, Bröcker, Jochen, Chen, Bohan, Stuart, Andrew

Bayesian filtering of partially and noisily observed dynamical systems seeks to infer the evolving conditional distribution of the state of a dynamical system, given observations, in an online fashion. This Bayesian filtering distribution is the natural object for uncertainty quantification, but it is rarely available as a supervised learning target. However, one can often use the forecast model to generate synthetic system trajectories, along with synthetic observations. We introduce the proper scoring ensemble filter (PSEF), an ensemble data assimilation method based on training an analysis map to approximate the filtering distribution using only synthetic state--observation trajectories. The analysis step is represented as a permutation-invariant, transformer-based map that takes as input a forecast ensemble and observations, producing an analysis ensemble. Training is based on strictly proper scoring rules -- with the energy score used in our implementation -- so that probabilistic accuracy is rewarded over the whole probability distribution. We prove that, under a realizability assumption, the population objective is minimized by the true Bayesian filtering distribution. We also derive the finite-ensemble empirical objective used in training and relate its single state--observation trajectory form to the population objective, using a mean-field consistency argument. Numerical experiments show that the learned filter accurately approximates challenging filtering distributions, including nonlinear, non-Gaussian, and multi-modal posteriors, and achieves stronger performance in data assimilation tasks than classical methods or learning-based methods with mean-squared-error objectives. For close-to-Gaussian problems, learning a correction to the EnKF is the best approach, while for highly non-Gaussian problems an end-to-end approach that discards this inductive bias is superior.

artificial intelligence, data mining, machine learning, (18 more...)

2606.26497

Country:

North America > United States (1.00)
Europe (0.67)

Genre: Research Report > New Finding (0.65)

Industry:

Leisure & Entertainment > Games (0.63)
Government > Regional Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.92)
(2 more...)

arXiv.org Machine LearningJun-26-2026

Beyond Global Divergences: A Local-Mass Perspective on Bayesian Inference

Xu, Hanli, He, Fengxiang, Moka, Sarat

Global objectives, such as KL divergence and ELBO, are widely used in Bayesian inference for measuring distributional discrepancy. This paper studies their local-mass behaviour that is not directly captured by such objectives. We introduce and use two mathematical tools: (1) Mass Index for recording the polynomial and logarithmic decay scales of local mass, and (2) regularised extended KL (RE-KL), a set-localised divergence that can be formulated in the presence of singular components. Mass Indices help characterise how Bayesian updating changes local mass: (1) power-log likelihood factors shift it explicitly, and (2) parameter-dependent supports, or their smooth softenings, may change the local scale through the amount of mass that remains near the parameter value. Using local RE-KL, we prove absolute, relative, and directional inequalities for comparing local small-ball masses under the two KL directions. Together, these results provide a local theoretical account of local mass behaviour. Experiments provide controlled illustrations of the local behaviour. Code is available at https://github.com/Forsythia0604/Local-Mass-Framework.

artificial intelligence, machine learning, mipow, (16 more...)

2606.2709

Genre: Research Report > Experimental Study (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsJun-22-2026, 22:32:54 GMT

Private Evolution Converges

Private Evolution (PE) is a promising training-free method for differentially private (DP) synthetic data generation. While it achieves strong performance in some domains (e.g., images and text), its behavior in others (e.g., tabular data) is less consistent. To date, the only theoretical analysis of the convergence of PE depends on unrealistic assumptions about both the algorithm's behavior and the structure of the sensitive dataset. In this work, we develop a new theoretical framework to understand PE's practical behavior and identify sufficient conditions for its convergence. For d-dimensional sensitive datasets with n data points from a convex and compact domain, we prove that under the right hyperparameter settings and given access to the Gaussian variation API proposed in [33], PE produces an (ε,δ)-DP synthetic dataset with expected 1-Wasserstein distance O(d(nε) 1/d) from the original; this establishes worst-case convergence of the algorithm as n . Our analysis extends to general Banach spaces as well. We also connect PE to the Private Signed Measure Mechanism, a method for DP synthetic data generation that has thus far not seen much practical adoption. We demonstrate the practical relevance of our theoretical findings in experiments.

artificial intelligence, machine learning, natural language, (20 more...)

Country: North America > Canada (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science (0.92)
(4 more...)

Neural Information Processing SystemsJun-22-2026, 18:59:49 GMT

Integral Imprecise Probability Metrics

Quantifying differences between probability distributions is fundamental to statistics and machine learning, primarily for comparing statistical uncertainty. In contrast, epistemic uncertainty--due to incomplete knowledge--requires richer representations than those offered by classical probability. Imprecise probability (IP) theory offers such models, capturing ambiguity and partial belief. This has driven growing interest in imprecise probabilistic machine learning (IPML), where inference and decision-making rely on broader uncertainty models--highlighting the need for metrics beyond classical probability. This work introduces the Integral Imprecise Probability Metric (IIPM) framework, a Choquet integral-based generalisation of classical Integral Probability Metrics (IPMs) to the setting of capacities--a broad class of IP models encompassing many existing ones, including lower probabilities, probability intervals, belief functions, and more. Theoretically, we establish conditions under which IIPM serves as a valid metric and metrises a form of weak convergence of capacities. Practically, IIPM not only enables comparison across different IP models but also supports the quantification of epistemic uncertainty (EU) within a single IP model. In particular, by comparing an IP model with its conjugate, IIPM gives rise to a new class of EU measures--Maximum Mean Imprecisions (MMIs)--which satisfy key axiomatic properties proposed in the uncertainty quantification literature. We validate MMI through selective classification experiments, demonstrating strong empirical performance against established EU measures, and outperforming them when classical methods struggle to scale to a large number of classes.

artificial intelligence, information management, machine learning, (18 more...)

Country:

North America > United States (0.45)
Europe > United Kingdom (0.27)

Genre:

Research Report > Experimental Study (1.00)
Overview (1.00)
Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Information Management (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

arXiv.org Machine LearningJun-18-2026

Wasserstein Policy Learning for Distributional Outcomes

Huang, Yiyan, Leung, Cheuk Hang, Wu, Qi, Zhang, Zhiheng

Offline policy learning has received growing attention in causal inference. The primary objective is to learn a policy (individualized treatment rule) as a mapping from covariates to treatment that maximizes the empirical welfare defined as the mean of scalar-valued potential outcomes. In this paper, we study offline policy learning with distribution-valued outcomes, where each potential outcome is a probability measure on $\mathbb{R}$ and the reward is defined through a utility functional applied to the Wasserstein barycenter of induced outcome distributions. We establish statistical guarantees for the policy learning framework based on both Inverse Probability Weighting (IPW) and Doubly Robust (DR) estimators. By handling the challenging uniform deviation over the product of the combinatorial policy class and the infinite-dimensional quantile domain, we prove that the finite-sample regret has leading dependence $\widetilde{\mathcal{O}}(\sqrt{\mathrm{N\text{-}dim}(Π)/N})$. In the one-dimensional Wasserstein setting and under the stated regularity conditions, the leading regret rate is still governed by the policy-class complexity. Moreover, we provide a minimax lower bound establishing the sharpness of the leading dependence on $N$ and $\mathrm{N\text{-}dim}(Π)$.

artificial intelligence, machine learning, probability, (15 more...)

2606.19117

Country: Asia > China > Guangdong Province (0.28)

Genre: Research Report (0.40)

Industry:

Government (0.92)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsJun-17-2026, 00:30:24 GMT

Collective Counterfactual Explanations: Balancing Individual Goals and Collective Dynamics

Counterfactual explanations provide individuals with cost-optimal recommendations to achieve their desired outcomes. However, when a significant number of individuals seek similar state modifications, this individual-centric approach can inadvertently create competition and introduce unforeseen costs. Additionally, disregarding the underlying data distribution may lead to recommendations that individuals perceive as unusual or impractical. To address these challenges, we propose a novel framework that extends standard counterfactual explanations by incorporating a population dynamics model. This framework penalizes deviations from equilibrium after individuals follow the recommendations, effectively mitigating externalities caused by correlated changes across the population. By balancing individual modification costs with their impact on others, our method ensures more equitable and efficient outcomes. We show how this approach reframes the counterfactual explanation problem from an individual-centric task to a collective optimization problem. Augmenting our theoretical insights, we design and implement scalable algorithms for computing collective counterfactuals, showcasing their effectiveness and advantages over existing recourse methods, particularly in aligning with collective objectives.

explanation, machine learning, natural language, (17 more...)

Country:

Europe > Germany (0.28)
North America > United States (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry: Banking & Finance (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)

arXiv.org Machine LearningJun-17-2026

Another Look at Log-PCA for Probability Measures: A Dynamical Formulation and Statistical Convergence

Xu, Peng, Zhu, Changbo, Kim, Young-Heon, Chen, Xiaohui

Principal component analysis (PCA) is a major statistical analysis and machine learning tool for dimensional reduction and visualization of high-dimensional datasets [1]. Classical PCA in the Euclidean space is to find the eigenvectors associated with the top eigenvalues of the covariance matrix. Geometrically, PCA can be interpreted as finding the orthogonal directions that maximize the projected data variance to the linear subspace spanned by those directions. Recently, efforts for extending the Euclidean PCA to capture variations for a collection of probability measures have been made [2, 3, 4]. Since the Wasserstein space is an infinite-dimensional curved space, one challenge is to define a proper notion of principal mode of variations in the space of probability measures. In this paper, we take a variational and dynamical perspective of the Euclidean PCA that has robust generalization to the Wasserstein geometry. Specifically, given input data points x1,...,xn in the Euclidean space Rm, performing the standard PCA to find the first principal mode of variation gt = xn +tv passing through the mean xn = n 1 Pn i=1 xi can be reformulated as minimizing the residuals by projecting each data point in the direction v: ˆv1 = argmin

artificial intelligence, machine learning, principal mode, (16 more...)

2606.17196

Country: North America > United States (0.46)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsJun-16-2026, 16:15:48 GMT

Hessian-guided Perturbed Wasserstein Gradient Flows for Escaping Saddle Points

Wasserstein gradient flow (WGF) is a common method to perform optimization over the space of probability measures. While WGF is guaranteed to converge to a first-order stationary point, for nonconvex functionals the converged solution does not necessarily satisfy the second-order optimality condition; i.e., it could converge to a saddle point. In this work, we propose a new algorithm for probability measure optimization, perturbed Wasserstein gradient flow (PWGF), that achieves second-order optimality for general nonconvex objectives. PWGF enhances WGF by injecting noisy perturbations near saddle points via a Gaussian process-based scheme. By pushing the measure forward along a random vector field generated from a Gaussian process, PWGF helps the solution escape saddle points efficiently by perturbing the solution towards the smallest eigenvalue direction of the Wasserstein Hessian. We theoretically derive the computational complexity for PWGF to achieve a second-order stationary point. Furthermore, we prove that PWGF converges to a global optimum in polynomial time for strictly benign objectives.

artificial intelligence, dxdy, machine learning, (17 more...)

Country: Asia (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)