eff
A single algorithm for both restless and rested rotting bandits
Seznec, Julien, Ménard, Pierre, Lazaric, Alessandro, Valko, Michal
In many application domains (e.g., recommender systems, intelligent tutoring systems), the rewards associated to the actions tend to decrease over time. This decay is either caused by the actions executed in the past (e.g., a user may get bored when songs of the same genre are recommended over and over) or by an external factor (e.g., content becomes outdated). These two situations can be modeled as specific instances of the rested and restless bandit settings, where arms are rotting (i.e., their value decrease over time). These problems were thought to be significantly different, since Levine et al. (2017) showed that state-of-the-art algorithms for restless bandit perform poorly in the rested rotting setting. In this paper, we introduce a novel algorithm, Rotting Adaptive Window UCB (RAW-UCB), that achieves near-optimal regret in both rotting rested and restless bandit, without any prior knowledge of the setting (rested or restless) and the type of non-stationarity (e.g., piece-wise constant, bounded variation). This is in striking contrast with previous negative results showing that no algorithm can achieve similar results as soon as rewards are allowed to increase. We confirm our theoretical findings on a number of synthetic and dataset-based experiments.
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- North America > United States (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.46)
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
Andreyev, Arseniy, Ananthkumar, Advikar, Walden, Marc, Poggio, Tomaso, Beneventano, Pierfrancesco
Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-β)/η$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+β)/η$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
A 1/R Law for Kurtosis Contrast in Balanced Mixtures
Bi, Yuda, Xiao, Wenjun, Bai, Linhao, Calhoun, Vince D
Abstract--Kurtosis-based Independent Component Analysis (ICA) weakens in wide, balanced mixtures. We also show that purification--selecting m R sign-consistent sources--restores R-independent contrast Ω(1/m), with a simple data-driven heuristic. Synthetic experiments validate the predicted decay, the T crossover, and contrast recovery. Independent Component Analysis (ICA) recovers statistically independent latent sources from linear mixtures and is identifiable whenever at most one source is Gaussian [1]. Excess kurtosis--the standardized fourth cumulant--is a central contrast function [9], and kurtosis-type nonlinearities remain standard in FastICA.
- North America > United States > Georgia > Fulton County > Atlanta (0.05)
- North America > United States > District of Columbia > Washington (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (2 more...)
- Health & Medicine > Health Care Technology (0.70)
- Health & Medicine > Therapeutic Area > Neurology (0.70)
Scaling Laws for Precision in High-Dimensional Linear Regression
Zhang, Dechen, Tang, Xuan, Liang, Yingyu, Zou, Difan
Low-precision training is critical for optimizing the trade-off between model quality and training costs, necessitating the joint allocation of model size, dataset size, and numerical precision. While empirical scaling laws suggest that quantization impacts effective model and data capacities or acts as an additive error, the theoretical mechanisms governing these effects remain largely unexplored. In this work, we initiate a theoretical study of scaling laws for low-precision training within a high-dimensional sketched linear regression framework. By analyzing multiplicative (signal-dependent) and additive (signal-independent) quantization, we identify a critical dichotomy in their scaling behaviors. Our analysis reveals that while both schemes introduce an additive error and degrade the effective data size, they exhibit distinct effects on effective model size: multiplicative quantization maintains the full-precision model size, whereas additive quantization reduces the effective model size. Numerical experiments validate our theoretical findings. By rigorously characterizing the complex interplay among model scale, dataset size, and quantization error, our work provides a principled theoretical basis for optimizing training protocols under practical hardware constraints.
Thermodynamic Isomorphism of Transformers: A Lagrangian Approach to Attention Dynamics
We propose an effective field-theoretic framework for analyzing Transformer attention through a thermodynamic lens. By constructing a Lagrangian on the information manifold equipped with the Fisher metric, we show that, within the Shannon--Boltzmann entropy framework, the Softmax function arises as a stationary solution minimizing a Helmholtz free energy functional. This establishes a formal correspondence between scaled dot-product attention and canonical ensemble statistics. Extending this mapping to macroscopic observables, we define an effective specific heat associated with fluctuations of the attention energy landscape. In controlled experiments on the modular addition task ($p = 19$--$113$), we observe a robust peak in this fluctuation measure that consistently precedes the onset of generalization. While no asymptotic power-law divergence is detected in this finite-depth regime, the reproducible enhancement of energy variance suggests a critical-like crossover accompanying representational reorganization. Our framework provides a unified statistical-mechanical perspective on attention scaling, training dynamics, and positional encoding, interpreting the phenomena as emergent properties of an effective thermodynamic system rather than isolated heuristics. Although the present results indicate finite-size crossover behavior rather than a strict phase transition, they motivate further investigation into scaling limits of deep architectures through fluctuation-based observables.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
The Cost of Learning under Multiple Change Points
Gafni, Tomer, Iyengar, Garud, Zeevi, Assaf
We consider an online learning problem in environments with multiple change points. In contrast to the single change point problem that is widely studied using classical "high confidence" detection schemes, the multiple change point environment presents new learning-theoretic and algorithmic challenges. Specifically, we show that classical methods may exhibit catastrophic failure (high regret) due to a phenomenon we refer to as endogenous confounding. To overcome this, we propose a new class of learning algorithms dubbed Anytime Tracking CUSUM (ATC). These are horizon-free online algorithms that implement a selective detection principle, balancing the need to ignore "small" (hard-to-detect) shifts, while reacting "quickly" to significant ones. We prove that the performance of a properly tuned ATC algorithm is nearly minimax-optimal; its regret is guaranteed to closely match a novel information-theoretic lower bound on the achievable performance of any learning algorithm in the multiple change point problem. Experiments on synthetic as well as real-world data validate the aforementioned theoretical findings.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
- Transportation > Passenger (0.46)
- Information Technology > Services (0.45)
- North America > United States > California > Los Angeles County > Long Beach (0.14)
- North America > Canada > British Columbia > Vancouver (0.05)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.05)
- (8 more...)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)