AITopics | supremum

Collaborating Authors

supremum

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Uniform Scaling Limits in AdamW-Trained Transformers

Gibson, William, Reisinger, Christoph

arXiv.org Machine LearningMay-13-2026

We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hidden states and backpropagated variables converge in $L^2$, uniformly over the initial condition, to the solution of a forward--backward system of ODEs at rate $\mathcal O(L^{-1}+L^{-1/3}H^{-1/2})$. Here, $L$ and $H$ denote the depth and number of heads of the transformer, respectively. The limiting system of ODEs can be identified with a McKean--Vlasov ODE (MVODE) when the attention heads do not incorporate causal masking. By using the flow maps associated with this MVODE and applying concentration of measure techniques, we obtain bounds on the difference between the discrete and continuous models that are uniform over compact sets of initial conditions. As this is achieved without resorting to a covering argument, the constants in our bounds are independent of the number of tokens. Furthermore, under a suitable adaptation to AdamW, the bounds become independent of the token embedding dimension.

large language model, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2605.11059

Country:

Europe > United Kingdom (0.28)
North America > United States > New York (0.27)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)

Add feedback

Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $τ$-Mixing

Halgryn, Leon, Langer, Sophie, Meylahn, Janusz M., Hahn, E. Moritz

arXiv.org Machine LearningMay-8-2026

Finite-sample analyses of deep Q-learning typically treat replayed data as independent, even though it is sampled from temporally dependent state-action trajectories. We study the Deep Q-networks (DQN) algorithm under explicit dependence by modelling the minibatches used for updating the network as $τ$-mixing. We show that this assumption holds under certain dependence conditions on the underlying trajectories and the mechanism used to sample minibatches. Building on this observation, we extend statistical analyses of DQN with fully connected ReLU architectures to dependent data. We formulate each update as a nonparametric regression problem with $τ$-mixing observations and derive finite-sample risk bounds under this dependence structure. Our results show that temporal dependence leads to a degradation in the statistical rate by inducing an additional dimensionality penalty in the rate exponent, reflecting the reduced effective sample size of $τ$-mixing data. Moreover, we derive the sample complexity of DQN under $tau$-mixing from these risk bounds. Finally, we empirically demonstrate on standard Gymnasium environments that the independence assumption is systematically violated and that replay sampling yields approximately exponentially decaying correlations, supporting our theoretical framework.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2605.06373

Genre: Research Report > New Finding (0.86)

Industry: Leisure & Entertainment > Games (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

The Many Faces of Adversarial Risk

Neural Information Processing SystemsApr-25-2026, 22:30:57 GMT

Adversarial risk quantifies the performance of classifiers on adversarially perturbed data. Numerous definitions of adversarial risk--not all mathematically rigorous and differing subtly in the details--have appeared in the literature. In this paper, we revisit these definitions, make them rigorous, and critically examine their similarities and differences. Our technical tools derive from optimal transport, robust statistics, functional analysis, and game theory. Our contributions include the following: generalizing Strassen's theorem to the unbalanced optimal transport setting with applications to adversarial classification with unequal priors; showing an equivalence between adversarial robustness and robust hypothesis testing with -Wasserstein uncertainty sets; proving the existence of a pure Nash equilibrium in the two-player game between the adversary and the algorithm; and characterizing adversarial risk by the minimum Bayes error between a pair of distributions belonging to the -Wasserstein uncertainty sets. Our results generalize and deepen recently discovered connections between optimal transport and adversarial robustness and reveal new connections to Choquet capacities and game theory.

adversarial risk, artificial intelligence, machine learning, (21 more...)

Neural Information Processing Systems

Country: North America > United States > Wisconsin (0.28)

Industry:

Leisure & Entertainment > Games (0.54)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

2417dc8af8570f274e6775d4d60496da-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 03:30:16 GMT

artificial intelligence, machine learning, nullnull, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Multi-step learning and underlying structure in statistical models

Maia Fraser

Neural Information Processing SystemsMar-23-2026, 06:54:58 GMT

In multi-step learning, where a final learning task is accomplished via a sequence of intermediate learning tasks, the intuition is that successive steps or levels transform the initial data into representations more and more "suited" to the final learning task. A related principle arises in transfer-learning where Baxter (2000) proposed a theoretical framework to study how learning multiple tasks transforms the inductive bias of a learner. The most widespread multi-step learning approach is semisupervised learning with two steps: unsupervised, then supervised. Several authors (Castelli-Cover, 1996; Balcan-Blum, 2005; Niyogi, 2008; Ben-David et al, 2008; Urner et al, 2011) have analyzed SSL, with Balcan-Blum (2005) proposing a version of the PAC learning framework augmented by a "compatibility function" to link concept class and unlabeled data distribution. We propose to analyze SSL and other multi-step learning approaches, much in the spirit of Baxter's framework, by defining a learning problem generatively as a joint statistical model on X Y.

artificial intelligence, learning, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Industry: Education (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

On the Properties of Kullback-Leibler Divergence Between Multivariate Gaussian Distributions

Neural Information Processing SystemsFeb-16-2026, 16:47:52 GMT

Kullback-Leibler (KL) divergence is one of the most important measures to calculate the difference between probability distributions. In this paper, we theoretically study several properties of KL divergence between multivariate Gaussian distributions.

artificial intelligence, data mining, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Asia > China > Hunan Province > Changsha (0.04)
North America > United States > California (0.04)
North America > Canada > British Columbia > Vancouver (0.04)
(2 more...)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

Add feedback

On Slicing Optimality for Mutual Information Ammar Fayad

Neural Information Processing SystemsFeb-14-2026, 13:29:19 GMT

P and Q, respectively, is tight in P (X Y). Hero, 2004; Ghourchian et al., 2017), we present the outline of our argument into three steps: K P (X) is tight iff the closure of K is sequentially compact in P ( X) with respect to the weak convergence. Remark 1. W e could proceed differently by imposing stronger assumptions using the following W e briefly discuss the outline of the proof for the sake of completeness. (Loeve, 2017). The argument here depends on two important facts: 1.

artificial intelligence, experiment, machine learning, (14 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

2417dc8af8570f274e6775d4d60496da-Supplemental.pdf

Neural Information Processing SystemsFeb-7-2026, 21:33:59 GMT

inequality, learning theory, nullnull, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > Middle East > Jordan (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

A Unified Kantorovich Duality for Multimarginal Optimal Transport

Cheryala, Yehya, Alaya, Mokhtar Z., Bouzebda, Salim

arXiv.org Machine LearningJan-27-2026

Multimarginal optimal transport (MOT) has gained increasing attention in recent years, notably due to its relevance in machine learning and statistics, where one seeks to jointly compare and align multiple probability distributions. This paper presents a unified and complete Kantorovich duality theory for MOT problem on general Polish product spaces with bounded continuous cost function. For marginal compact spaces, the duality identity is derived through a convex-analytic reformulation, that identifies the dual problem as a Fenchel-Rockafellar conjugate. We obtain dual attainment and show that optimal potentials may always be chosen in the class of $c$-conjugate families, thereby extending classical two-marginal conjugacy principle into a genuinely multimarginal setting. In non-compact setting, where direct compactness arguments are unavailable, we recover duality via a truncation-tightness procedure based on weak compactness of multimarginal transference plans and boundedness of the cost. We prove that the dual value is preserved under restriction to compact subsets and that admissible dual families can be regularized into uniformly bounded $c$-conjugate potentials. The argument relies on a refined use of $c$-splitting sets and their equivalence with multimarginal $c$-cyclical monotonicity. We then obtain dual attainment and exact primal-dual equality for MOT on arbitrary Polish spaces, together with a canonical representation of optimal dual potentials by $c$-conjugacy. These results provide a structural foundation for further developments in probabilistic and statistical analysis of MOT, including stability, differentiability, and asymptotic theory under marginal perturbations.

artificial intelligence, dual problem, machine learning, (19 more...)

arXiv.org Machine Learning

2601.17171

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback