AITopics | log 2

We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels $δ\in (0,1]$, thereby characterizing the regret distribution across the full range of $δ$. We present a simple UCBVI-style algorithm with exploration bonus $\min\{c_{1,k}/N, c_{2,k}/\sqrt{N}\}$, where $N$ denotes the visit count and $(c_{1,k},c_{2,k})$ are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes. As a special case, for multi-armed bandits with $A$ arms and horizon $T$, we obtain a distributional regret bound of order $\mathcal{O}(\sqrt{AT}\log(1/δ))$, confirming the conjecture of Lattimore & Szepesvári (2020, Section 17.1) for the first time.

data mining, machine learning, reinforcement learning, (22 more...)

arXiv.org Machine Learning

2605.05102

Genre: Research Report (0.81)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.84)

Add feedback

Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

Sahu, Sharan, Sarkar, Abir, Hogan, Cameron J., Wells, Martin T.

arXiv.org Machine LearningMay-7-2026

We provide a theoretical analysis of Adam under non-stationary stochastic objectives, separating two regimes: Euclidean tracking under adaptive strong monotonicity of the Adam-preconditioned mean-gradient operator, and high-probability projected stationarity guarantees under general $L$-smooth objectives. In the tracking regime, we derive finite-time expected and high-probability bounds that decompose sharply into four components: initialization, objective drift, a first-moment tracking error governed by $β_1$, and a preconditioner perturbation governed by $β_2$. We characterize the burn-in time to reach Adam's irreducible tracking floor under constant and step-decay schedules. We also prove a high-probability bound on the average projected stationarity gap for Adam under distribution shift. Across both analyses, our bounds reveal a noise--drift tradeoff: in noise-dominated regimes, first-moment averaging and adaptive preconditioning can improve the high-probability error, whereas in drift-dominated regimes, stale first-moment information and preconditioner perturbations can compound the cost of nonstationarity, allowing vanilla SGD to achieve a smaller tracking floor. Our explicit $(β_1,β_2,ε)$-dependent bounds delineate when adaptive step-sizing is beneficial versus harmful, and provide a theoretical mechanism for Adam's empirical instability and stabilization under distribution shift.

artificial intelligence, log 2, machine learning, (15 more...)

arXiv.org Machine Learning

2605.04269

Country: North America > United States > New York (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

b069b3415151fa7217e870017374de7c-Paper.pdf

Neural Information Processing SystemsApr-30-2026, 22:25:55 GMT

artificial intelligence, machine learning, metric space, (19 more...)

Neural Information Processing Systems

Country: North America > United States > Ohio (0.14)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

eb1bad7a84ef68a64f1afd6577725d45-Paper-Conference.pdf

Neural Information Processing SystemsApr-30-2026, 04:41:45 GMT

artificial intelligence, deep learning, machine learning, (19 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Data Science (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

a75db7d2ee1e4bee8fb819979b0a6cad-Paper-Conference.pdf

Neural Information Processing SystemsApr-29-2026, 07:38:39 GMT

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: North America > United States (0.93)

Genre: Research Report (0.69)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (0.67)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Add feedback

Sample Complexity of Forecast Aggregation

Neural Information Processing SystemsApr-28-2026, 14:13:10 GMT

We consider a Bayesian forecast aggregation model where nexperts, after observing private signals about an unknown binary event, report their posterior beliefs about the event to a principal, who then aggregates the reports into a single prediction for the event. The signals of the experts and the outcome of the event follow a joint distribution that is unknown to the principal, but the principal has access to i.i.d. "samples" from the distribution, where each sample is a tuple of the experts' reports (not signals) and the realization of the event. Using these samples, the principal aims to find an ε-approximately optimal aggregator, where optimality is measured in terms of the expected squared distance between the aggregated prediction and the realization of the event. We show that the sample complexity of this problem is at least Ω(mn 2/ε) for arbitrary discrete distributions, where m is the size of each expert's signal space. This sample complexity grows exponentially in the number of experts n. But, if the experts' signals are independent conditioned on the realization of the event, then the sample complexity is significantly reduced, to O(1/ε2), which does not depend on n. Our results can be generalized to non-binary events. The proof of our results uses a reduction from the distribution learning problem and reveals the fact that forecast aggregation is almost as difficult as distribution learning.

artificial intelligence, bayesian inference, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Europe (0.67)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Add feedback

543924fdf260ba990f2ef84f940f3db2-Paper-Conference.pdf

Neural Information Processing SystemsApr-27-2026, 13:45:47 GMT

artificial intelligence, data mining, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Industry: Education > Educational Setting (1.00)

Technology:

Information Technology > Data Science > Data Mining (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.45)

Add feedback

Parameter Tuning

Neural Information Processing SystemsApr-27-2026, 09:50:56 GMT

If observations from the joint distribution of (A,Y,Z,W) are available in both stages, we can tune the regularization parameters λ1,λ2 using the approach proposed in Singh et al. [30], Xu et al. [35]. Let the complete data of stage 1 and stage 2 be denoted as (ai,yi,zi,wi) and ( ai, yi, zi, wi). Then, we can use the data not used in each stage to evaluate the out-of-sample performance of the other stage. A(2), ˆV(T),u(T) are the learned parameters by Algorithm 1. In this appendix, we prove propositions given in the main text. In the following, we assume that the spaces U, A, Z,W are separable and completely metrizable topological spaces and equipped with Borel σ-algebras. In this section, we use the notation PA|Z=z to express the distribution of a random variable Agiven another variable Z = z.

artificial intelligence, log 2, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback