Energy
Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos
We develop a mean-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos. Dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at critical initialization. We derive critical and crossover scaling laws for correlation decay and establish that smooth activations and kinked, ReLU-like activations constitute distinct universality classes, with different critical exponents and a universal two-parameter scaling collapse in detuning and dropout strength. The distinction traces to the analytic structure of the correlation map: smooth activations admit a Taylor expansion near perfect alignment, while kinked activations develop a branch point with universal non-analyticity. As a corollary, the framework yields saturated dropout profiles under fixed budget; a rank-flow tie-breaker then selects front-loaded schedules, substantially reducing held-out test loss at no extra computational cost, with accuracy gains as a consistent secondary effect. We test the predictions in MLPs and Vision Transformers and discuss CNN/ResNet extensions.
Guiding Multi-Objective Genetic Programming with Description Length Improves Symbolic Regression Solutions
Kronberger, Gabriel, de Franca, Fabricio Olivetti, Bartlett, Deaglan J., Desmond, Harry, Ferreira, Pedro G.
Symbolic regression with genetic programming (GPSR) may suffer from overfitting and structural bloat, especially when noise is present. In this paper we evaluate description length (DL) and fractional Bayes factor (FBF) criteria as principled, data-efficient alternatives to heuristics for selecting compact expressions that generalise well. We implement DL using a Fisher-information-based parameter encoding and compare it to AIC and BIC across multiple datasets, including noisy synthetic benchmarks and real-world regression problems. We study three search/selection strategies: (i) multi-objective search for accuracy and program length followed by DL/FBF selection; (ii) multi-objective search using DL directly as an objective; and (iii) single-objective optimisation with DL/FBF as the fitness. Across datasets we find that DL/FBF post-selection improves test performance compared to AIC/BIC baseline and that BIC in combination with the same function complexity penalty from DL/FBF produces similar results. In contrast, using DL/FBF directly as a fitness function in single-objective GPSR frequently induces premature convergence to overly simple models. We conclude with practical guidance for using DL/FBF as robust model-selection tools in genetic programming workflows.
Topological Kalman Filtering on Cell Complexes
Liu, Chengen, Money, Rohan, Gao, Ting, Sabbaqi, Mohammad, Beferull-Lozano, Baltasar, Isufi, Elvin
Inferring latent dynamics from multivariate time-series defined over topological cell complexes is crucial for capturing the complex, higher-order interactions inherent in real-world systems such as in water, sensor, and transportation networks. However, reconstructing these latent states is challenging because the signals are coupled across higher-order topologies, while high dimensionality, nonlinear observations, and unknown structures increase the difficulty. To address this, we propose a topology-aware state space framework derived from stochastic partial differential equations on cell complexes. State evolution follows heat-like topological diffusion, with perturbations propagating along boundary operators. Under partial observability, we model observations using a cell complex convolution of latent states coupled with a nonlinear mapping. We perform recursive state estimation via an Extended Kalman Filter, simultaneously learning model parameters and uncertainties through an online Expectation-Maximization algorithm. Finally, for scenarios where only lower-order topological structure is known, e.g., nodes and edges, as in critical infrastructure networks, we introduce a heuristic cell identification algorithm to explicitly infer the second-order cell structures. Validations on synthetic and real datasets from water, sensor and transportation networks demonstrate that our approach yields reliable estimates under partial observability and successfully recovers the underlying topological structures.
Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity
Khosravi, Hamed, Huo, Xiaoming
Many bandit deployments (recommendation, clinical dosing, ad targeting) share two facts prior work handles only in isolation: rewards live on a low-dimensional latent subspace, and that subspace drifts. Stationary low-rank bandits exploit rank but break under subspace change; non-stationary linear bandits adapt to drift but pay ambient rate $\widetilde{O}(d\sqrt{T})$. We study piecewise-stationary low-rank linear contextual bandits with scalar feedback: $ฮธ_t = B_k^\star w_t$ with rank-$r$ factor $B_k^\star\in\mathbb{R}^{d\times r}$ constant within each of $K$ unknown segments and able to shift at boundaries. Our results are tight along three axes. (i) Identification boundary. With single-play scalar rewards, the moving subspace is recoverable through quadratic functionals of rewards iff three probe-side conditions hold: known noise variance, bounded state-noise coupling, and full-dimensional probe support. Each is necessary in the unrestricted-second-moment problem, and jointly they are sufficient, characterizing the boundary of the solvable region. (ii) Algorithm and dynamic regret. SPSC interleaves isotropic probes with windowed projected ridge-UCB exploitation inside the learned $r$-dimensional subspace; a CUSUM-style variant discovers segment boundaries online. The costed dynamic regret is $\widetilde{O}(r\sqrt{T})+\widetilde{O}(T^{2/3})+O(W\,V_{\mathrm{in}})$, replacing the ambient $d\sqrt{T}$ rate with the intrinsic rank. (iii) Empirics. On eleven benchmarks spanning synthetic, UCI/MovieLens, semi-synthetic clinical, and ZOZOTOWN production-log data, SPSC outperforms non-stationary and low-rank baselines whenever $d-r\gtrsim T^{1/6}$, matching the analytical crossover. To our knowledge, this is the first work to characterize the identification boundary and attain the intrinsic-rank dynamic-regret rate in this setting.
Conditioning Gaussian Processes on Almost Anything
Moss, Henry, Astfalck, Lachlan, Cowperthwaite, Thomas, Doumont, Colin, Willis, Sam, Hennig, Philipp, Nemeth, Christopher, Zammit-Mangion, Andrew
Gaussian processes (GPs) offer a principled probabilistic model over functions, but exact inference is restricted to the linear-Gaussian regime. We establish an explicit equivalence between GPs and a class of linear diffusion models, recasting predictive sampling as an ODE with closed-form Gaussian dynamics and a likelihood-dependent guidance term that admits a simple Monte Carlo approximation. In the linear-Gaussian setting, we recover standard GP conditioning exactly; beyond conjugacy, the same machinery handles any conditioning statement admitting point-wise likelihood evaluation -- including non-linear physics, and, for the first time, natural language via large language models. Whitening isolates the irreducible non-Gaussian dynamics, minimising Wasserstein-2 transport cost and eliminating numerical stiffness. The result is a general-purpose GP inference scheme requiring no bespoke derivations. Together, these results provide a general mechanism for incorporating the full richness of real-world knowledge as conditioning information, opening a new frontier for the probabilistic modelling of real-world problems.
$L^2$ over Wasserstein: Statistical Analysis for Optimal Transport
Passeggeri, Riccardo, Shenoy, Rohan M., Ye, Pengcheng
Optimal transport provides an inherently geometric and highly structured framework for studying spaces of probability measures, supplying a rich theoretical toolkit for contemporary statistics, machine learning, and generative modelling. In applications, however, the measures of interest are almost never known precisely, calling for a theory of optimal transport that accounts for statistical uncertainty. We construct such a framework, lifting the classical theory to the setting of random probability measures. We introduce the $L^2$ over Wasserstein space establishing that it inherits the formal Riemannian structure of the Wasserstein space by characterising distances and geodesic geometry. The structure induces random flows with Wasserstein gradient flow sample paths, making it the natural extension of the Wasserstein space which allows for random gradient flow dynamics. We ensemble statistical convergence results of the optimal transport machinery using the empirical measure within the $L^2$ over Wasserstein framework. Moreover, in the setting of Bayesian non-parametrics, we refine Schwartz's consistency theorem to the Wasserstein topology and deduce posterior convergence of the same machinery in the $L^2$ over Wasserstein space. We demonstrate that the growing theory of random token sampling for transformer models using self-attention flow paths can be embedded into the our framework. The results provide a unified treatment of random optimal transport and its consequences for principled inference and generative modelling under the statistical uncertainty of random sampling.
Neural Negative Binomial Regression for Weekly Seismicity Forecasting: Per-Cell Dispersion Estimation and Tail Risk Assessment
Earthquake forecasting is a critical task for natural risk management, infrastructure resilience planning, and emergency response operations. For Central Asia, and the Tian Shan mountain system in particular, this problem carries heightened importance due to high tectonic activity, complex geodynamics, and pronounced spatiotemporal heterogeneity of seismic processes. In the applied setting, the goal is not a deterministic forecast of individual events, but a macroscopic forecast of seismicity intensity: estimating the expected number of earthquakes with magnitude M 3.0 on a spatial grid at a weekly horizon. Historically, count data forecasting in fixed spatiotemporal cells has been formulated within the Poisson framework. However, its key assumption--equality of the conditional mean and conditional variance--is systematically violated in real seismological data. Earthquakes exhibit pronounced clustering associated with swarm activity, foreshock-aftershock sequences, and episodes of anomalous activity, resulting in overdispersion in which the variance substantially exceeds the mean. Under these conditions, uncritical application of the Poisson distribution leads to biased uncertainty estimates and, consequently, to underestimation of the risk of extreme scenarios. Despite the widespread adoption of machine learning methods in seismological problems, a substantial portion of existing work remains methodologically vulnerable. On one hand, several approaches apply continuous regression loss functions and metrics (e.g., MSE), ignoring the
Accurate Evaluation of Quickest Changepoint Detectors via Non-parametric Survival Analysis
Miyagawa, Taiki, Ebihara, Akinori F.
We propose non-parametric estimators for the average run length (ARL) and average detection delay (ADD) in quickest changepoint detection (QCD) under finite and irregular sequence lengths. Although ARL and ADD are widely used as optimality criteria in theoretical and simulation studies, their application to real-world datasets is hindered by limited and irregular sequence lengths. To address this issue, we propose non-parametric estimators for the ARL and ADD, termed KM-ARL and KM-ADD, by drawing an analogy between QCD and survival analysis to model detection probabilities under sequence truncation. We derive estimation bias bounds and prove that they are asymptotically unbiased unless extrapolation is required. Experiments on simulated and real-world datasets demonstrate their practical utility, enhancing robustness against limited and irregular sequence lengths, improving interpretability, and facilitating empirical, intuitive model selection. Our Python code is provided at https://github.com/TaikiMiyagawa/Kaplan-Meier-Average-Run-Length, offering ready-to-use implementations for practitioners.
The Thermodynamic Costs of Simple Linear Regression
D'Ambrosia, Samuel H., Daniels, Sultan M., DeWeese, Michael R., Sahai, Anant
The construction of models from data is a significant contributor to the energetic costs of computation. Because of this, understanding how foundational thermodynamic bounds apply to modeling algorithms will be increasingly important. Here, we study the thermodynamic costs of a basic and fundamental modeling algorithm: simple linear regression. Following Landauer, we approximate the thermodynamic lower bound on irreversibly performing both exact linear regression and linear regression via stochastic gradient descent as implemented on floating-point numbers. From this, we derive energycost aware scaling laws for the optimal dataset size for training a linear regression model given a generalization error dependent demand for inference. Additionally, we discuss a method to lower bound the entropy production from the mismatch cost for algorithms with continuous input variables.
Tweedie's Formulae and Diffusion Generative Models Beyond Gaussian
Tang, Wenpin, Touzi, Nizar, Zhang, Zikun, Zhou, Xun Yu
Diffusion models have achieved remarkable success in generating samples from unknown data distributions. Most popular stochastic differential equation-based diffusion models perturb the target distribution by adding Gaussian noise, transforming it into a simple prior, and then use denoising score matching, a consequence of Tweedie's formula, to learn the score function and generate clean samples from noise. However, non-Gaussian diffusion models with state-dependent diffusion coefficient have been largely underexplored, as have the corresponding Tweedie's formulae. In this work, we extend Tweedie's formula to important non-Gaussian processes, including geometric Brownian motion (GBM), squared Bessel (BESQ) processes, and Cox-Ingersoll-Ross (CIR) processes, thereby yielding the corresponding denoising score-matching objectives. We then apply the derived formulae to image and financial time series generation using GBM-and CIR-based diffusion models, and to empirical Bayes estimation under the BESQ setting. The reported experimental results demonstrate the potential of non-Gaussian models. Key words: Bessel processes, denoising score matching, diffusion models, empirical Bayes, financial time series, geometric Brownian motion, Tweedie's formula.