AITopics

2605.21292

Country: North America > United States (0.45)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)

Khoury, Fares El, Zenati, Houssam, Kallus, Nathan, Arbel, Michael, Bibaut, Aurélien

Semiparametric Efficient Bilevel Gradient Estimation

arXiv.org Machine LearningMay-21-2026

Bilevel optimization provides a natural framework for problems in which one learning task is constrained by the solution of another. This hierarchical structure appears across machine learning, including hyperparameter optimization [43, 39, 36], meta-learning [20, 18, 45], inverse problems and optimal control [31, 1], reinforcement learning [25], domain adaptation [35], and instrumental variable regression [42, 50, 49]. In these applications, the outer parameter is typically updated using gradient-based methods, so the quality of the resulting bilevel gradient directly affects both optimization and statistical performance. Most existing theory for bilevel optimization has been developed in finite-dimensional parametric settings, often under strong convexity of the lower-level problem [21, 27, 29, 61]. This assumption gives a unique inner solution and makes implicit differentiation stable [43, 36]. It is also convenient for algorithmic convergence and stability analyses [9, 23, 40].

artificial intelligence, efficient influence function, machine learning, (11 more...)

2605.21341

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

arXiv.org Machine LearningMay-21-2026

Neural Negative Binomial Regression for Weekly Seismicity Forecasting: Per-Cell Dispersion Estimation and Tail Risk Assessment

Igilik, Alim

Earthquake forecasting is a critical task for natural risk management, infrastructure resilience planning, and emergency response operations. For Central Asia, and the Tian Shan mountain system in particular, this problem carries heightened importance due to high tectonic activity, complex geodynamics, and pronounced spatiotemporal heterogeneity of seismic processes. In the applied setting, the goal is not a deterministic forecast of individual events, but a macroscopic forecast of seismicity intensity: estimating the expected number of earthquakes with magnitude M 3.0 on a spatial grid at a weekly horizon. Historically, count data forecasting in fixed spatiotemporal cells has been formulated within the Poisson framework. However, its key assumption--equality of the conditional mean and conditional variance--is systematically violated in real seismological data. Earthquakes exhibit pronounced clustering associated with swarm activity, foreshock-aftershock sequences, and episodes of anomalous activity, resulting in overdispersion in which the variance substantially exceeds the mean. Under these conditions, uncritical application of the Poisson distribution leads to biased uncertainty estimates and, consequently, to underestimation of the risk of extreme scenarios. Despite the widespread adoption of machine learning methods in seismological problems, a substantial portion of existing work remains methodologically vulnerable. On one hand, several approaches apply continuous regression loss functions and metrics (e.g., MSE), ignoring the

2605.21437

Country:

Asia (0.68)
North America > United States (0.46)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (0.70)
Energy > Oil & Gas (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Miyagawa, Taiki, Ebihara, Akinori F.

Accurate Evaluation of Quickest Changepoint Detectors via Non-parametric Survival Analysis

We propose non-parametric estimators for the average run length (ARL) and average detection delay (ADD) in quickest changepoint detection (QCD) under finite and irregular sequence lengths. Although ARL and ADD are widely used as optimality criteria in theoretical and simulation studies, their application to real-world datasets is hindered by limited and irregular sequence lengths. To address this issue, we propose non-parametric estimators for the ARL and ADD, termed KM-ARL and KM-ADD, by drawing an analogy between QCD and survival analysis to model detection probabilities under sequence truncation. We derive estimation bias bounds and prove that they are asymptotically unbiased unless extrapolation is required. Experiments on simulated and real-world datasets demonstrate their practical utility, enhancing robustness against limited and irregular sequence lengths, improving interpretability, and facilitating empirical, intuitive model selection. Our Python code is provided at https://github.com/TaikiMiyagawa/Kaplan-Meier-Average-Run-Length, offering ready-to-use implementations for practitioners.

artificial intelligence, machine learning, sequence, (12 more...)

2605.18798

Country:

North America > United States (0.92)
Asia (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.67)

Industry:

Health & Medicine (1.00)
Energy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Saha, Aytijhya, Bates, Stephen, Shah, Devavrat

Causal Inference with Categorical Unobserved Confounder via Mixture Learning

Unobserved confounding is a fundamental challenge for estimating causal effects. To address unobserved confounding, recent literature has turned to two different approaches -- proxy variables and the use of multiple treatments. The first approach, commonly referred to as proximal causal inference, requires proxies to be assigned to specific asymmetric roles: treatment-inducing proxies (negative control exposures), variables that act as common causes of the treatment and outcome, and outcome-inducing proxies (negative control outcomes). In practice, however, identifying variables that satisfy these asymmetric roles can be difficult depending on the application domain. The second approach, commonly referred to as the ``Deconfounder," deals with multiple conditionally independent treatments. There has been limited progress towards developing a consistent estimation method for this setting. As the primary contribution of this work, we establish that causal effects are identifiable in both settings when the unobserved confounder is categorical under suitable conditions. Our approach builds on a mixture learning perspective: we show that the underlying confounding structure can be recovered by identifying the corresponding mixture distribution. We propose an estimation procedure based on tensor decomposition, which allows consistent recovery of the latent structure and comes with non-asymptotic guarantees. Simulation studies and real data experiments demonstrate that the proposed method performs well even with limited data.

artificial intelligence, assumption, machine learning, (15 more...)

2605.19006

Genre: Research Report (1.00)

Industry: Health & Medicine > Health Care Technology (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (1.00)

Learning Interpretable Point-Based Clinical Risk Scores via Direct Optimization

Cui, Ying, Li, Albert M, Charu, Vivek, Hwang, Yeon-Mi, Hernandez-Boussard, Tina, Tian, Lu

Many clinical risk scores are deployed as additive rules with nonnegative integer points assigned to relevant binary predictive features. These integer weights not only make the score easier to use in practice but also promote sparsity in the resulting prediction model. Such risk scores are often derived by first fitting a regression model and then rounding the estimated coefficients to the nearest integer after appropriate scaling. This approach is computationally fast but does not guarantee optimality of the resulting score. Alternatively, one may search over all possible integer weights to directly optimize a value function by posing the problem as an integer programming task. However, the associated computational burden can be substantial, especially when the value function is nonconcave or even discontinuous. In this paper, we develop new machine learning algorithms that employ a flexible greedy optimization strategy to learn such additive scoring directly under explicit and sensible optimality objectives. We apply the proposed method to a large electronic health record (EHR) cohort in Epic Cosmos to construct an integer-weighted comorbidity score for measuring the risk of post-discharge mortality. We also conduct a simulation study to examine the finite-sample operating characteristics.

artificial intelligence, machine learning, predictor, (16 more...)

2605.19113

Country: North America (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Dual-Channel Tensor Neural Networks: Finite-Sample Theory and Conformal Structure Selection

Chen, Elynn, Li, Jiayu, Zheng, Zheshi, Pei, Jian

Tensor-valued data arise naturally in neuroimaging, genomics, climate science, and spatiotemporal networks, where multilinear dependencies across modes carry information that is destroyed under vectorization. Existing approaches either impose a single low-rank structure, which can miss localized signal, or treat the tensor as a long vector, which discards its multiway geometry. We propose a *Dual-Channel Tensor Neural Network* (DC-TNN) that decomposes each tensor input into a low-rank core and a sparse refinement, and processes the two components through coupled neural channels. The framework is structure-agnostic and accommodates CP, Tucker, and tensor-train cores within a single architecture. For estimation, we establish non-asymptotic risk bounds for the DC-TNN estimator that decompose into network approximation, core estimation, and refinement-selection terms, and show that the effective dimension is determined jointly by the core rank and refinement sparsity rather than by the ambient tensor size. For inference, we develop a *structure-aware conformal ROC* procedure that calibrates within the core-refinement latent space and produces ROC and AUC confidence bands with finite-sample, distribution-free coverage. Building on this, we propose a *conformal structure selector* that, to our knowledge, is the *first distribution-free procedure* for choosing among candidate tensor decompositions with finite-sample validity. Simulations and an analysis of a protein dataset demonstrate competitive predictive accuracy, reliable uncertainty quantification, and consistent recovery of the tensor structure.

artificial intelligence, machine learning, tucker, (18 more...)

2605.19122

Genre: Research Report (1.00)

Industry:

Media > Television (0.59)
Health & Medicine > Therapeutic Area > Neurology (0.34)
Health & Medicine > Diagnostic Medicine > Imaging (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

D'Ambrosia, Samuel H., Daniels, Sultan M., DeWeese, Michael R., Sahai, Anant

The Thermodynamic Costs of Simple Linear Regression

The construction of models from data is a significant contributor to the energetic costs of computation. Because of this, understanding how foundational thermodynamic bounds apply to modeling algorithms will be increasingly important. Here, we study the thermodynamic costs of a basic and fundamental modeling algorithm: simple linear regression. Following Landauer, we approximate the thermodynamic lower bound on irreversibly performing both exact linear regression and linear regression via stochastic gradient descent as implemented on floating-point numbers. From this, we derive energycost aware scaling laws for the optimal dataset size for training a linear regression model given a generalization error dependent demand for inference. Additionally, we discuss a method to lower bound the entropy production from the mismatch cost for algorithms with continuous input variables.

artificial intelligence, entropy, machine learning, (17 more...)

2605.19195

Country: North America > United States > California (0.28)

Genre:

Research Report (0.82)
Workflow (0.67)

Industry: Energy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Wood, Kieran, Zohren, Stefan, Roberts, Stephen J.

DeRegiME: Deep Regime Mixtures for Probabilistic Forecasting under Distribution Shift

We introduce DeRegiME -- Deep Regime Mixture of Experts -- a direct multi-horizon probabilistic forecaster that separates latent uncertainty regimes from the underlying signal and softly assigns each forecast location to learned recurring regimes using a sparse variational Gaussian process (GP) whose nonstationary regime-mixing kernel and Student-t likelihood combine per-regime sub-kernels and noise processes via a shared gate. This yields a single sparse-GP posterior, not a mixture of GP experts. DeRegiME addresses a key limitation of neural forecasters: point forecasts discard residual uncertainty, and probabilistic heads -- whether single marginals, uninterpreted mixtures, quantile sets, or diffusion samples -- rarely expose the regime structure of the residual. Yet distribution shift in noisy heteroskedastic time series may be abrupt, gradual, or horizon-dependent and often appears in residual uncertainty rather than the conditional mean. DeRegiME yields an interpretable mean-residual-noise decomposition with a direct-sum feature-space representation that anchors regimes as clusters of residual similarity whose transitions surface as implicit changepoints. The effective number of regimes is pruned by the stick-breaking gate. We prove kernel validity and predictive-density propriety, and across ten benchmarks and three encoder grids DeRegiME improves negative log predictive density (NLPD) by 20.3% over the strongest encoder-matched baseline, a DeepAR/GluonTS-style dynamic Student-t head, with parallel gains on CRPS (3.0%) and MSE (4.7%). Improvements are consistent across all datasets, which span abrupt, gradual, and seasonal shifts.

artificial intelligence, machine learning, regime, (17 more...)

2605.19231

Genre: Research Report (0.50)

Industry: Banking & Finance > Trading (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Factor Augmented High-Dimensional SGD

Li, Shubo, Han, Yuefeng, Yu, Xiufan

Stochastic gradient descent (SGD) has been a cornerstone of machine learning since the pioneering work of Robbins & Monro (1951). Beyond its algorithmic simplicity and scalability, SGD has also become a central object of theoretical study, with refined analyses linking its dynamics to implicit regularization, generalization performance, and algorithmic stability. For decades, theoretical analyses of SGD have largely resided within the realm of classical stochastic approximation (Polyak & Juditsky, 1992; Lai, 2003; Bottou et al., 2018), where the data dimension is considered fixed while the sample size tends to infinity. While this regime has yielded foundational insights, it no longer fully reflects the characteristics of modern learning systems. Contemporary applications often operate in regimes where data dimension, sample size, and model complexity grow together, calling for new theoretical tools and perspectives that go beyond traditional asymptotic analyses. In this study, we focus on the learning tasks involving high-dimensional predictors. When SGD is applied directly to such data, the dimensionality of the feature space propagates into the optimization process, resulting in a highdimensional (HD) parameter space. Algorithmically, one trending strategy is to approximate the gradient updates using a low-rank representation to reduce memory costs and accelerate computation (Wang et al., 2018; Vogels et al., 2019; Kozak et al., 2019; Kasiviswanathan, 2021; Zhao et al., 2024). Theoretically, despite the vast literature on SGD, convergence guarantees of HD-SGD remain limited (Garrigos & Gower, 2023; Li et al., 2025).

artificial intelligence, factor model, machine learning, (16 more...)

2605.19291

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)