AITopics

2605.0527

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.27)
South America > Brazil > Ceará (0.25)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Rudoler, Joseph H., Tan, Kevin, Hooker, Giles, Kording, Konrad P.

Estimating Implicit Regularization in Deep Learning

Deep learning systems are known to exhibit implicit regularization (alt. implicit bias), favoring simple solutions instead of merely minimizing the loss function. In some cases, we can analytically derive the implicit regularization -- connecting it to an equivalent penalty that augments the learning objective. However, modern deep learning systems are complex, carrying modifications to the training procedure and architecture (e.g. early stopping, minibatching, dropout) whose effects are not always directly interpretable. Although estimating the resulting implicit regularization could aid theorists in algorithm design and practitioners in interpreting their hyperparameter choices, this problem has received little direct attention. It is also tractable: regularization makes weight updates deviate from loss gradients, promising a signal for identifying implicit bias. Here we provide gradient matching methods that can be used to empirically estimate the implicit regularization. Our method works on networks with known regularization, recovering popular explicit penalties like $\ell_1$ and $\ell_2$. It also replicates known implicit effects, like the quadratic weight penalty induced by early stopping in gradient descent, demonstrating that it can be used to test theories of implicit regularization. Crucially, because our method is empirical, it can handle implicit regularization in arbitrary networks. We demonstrate this use by characterizing the effects of dropout in deep networks, showing implicit $\ell_2$ effects in this popular method. Our work shows that practitioners can use gradient matching to understand regularization in networks with implicit biases that are too complicated to derive analytically.

artificial intelligence, machine learning, regularization, (17 more...)

2605.05436

Country: North America > United States > Pennsylvania (0.40)

Genre: Research Report (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Cui, Chengyu, Xu, Gongjun

Convexity in Disguise: A Theoretical Framework for Nonconvex Low-Rank Matrix Estimation

Nonconvex methods have emerged as a dominant approach for low-rank matrix estimation, a problem that arises widely in machine learning and AI for learning and representing high-dimensional data. Existing analyses for these methods often require additional regularization to mitigate nonconvexity, even though such regularization is often unnecessary in practice. Moreover, most analyses rely on problem-specific arguments that are difficult to generalize to more complex settings. In this paper, we develop a theoretical framework for studying nonconvex procedures across a broad class of low-rank matrix estimation problems. Rather than focusing on a specific model, we reveal a fundamental mechanism that explains why nonconvex procedures can behave well in low-rank estimation. Our key device is a {\it benign regularizer} that does not alter the original update rule, but yields an equivalent locally strongly convex formulation of the algorithm. This perspective uncovers a disguised convexity inherent in the nonconvex procedure and provides a new route to theoretical guarantees for nonconvex low-rank matrix estimation.

artificial intelligence, contraction, machine learning, (17 more...)

2605.05446

Country: North America > United States > Michigan (0.40)

Genre:

Workflow (0.69)
Research Report (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors

Moufad, Badr, Ilina, Albina, Habi, Hai Victor, Lahlou, Salem, Janati, Yazid, Messer, Hagit, Moulines, Eric

Commercial Microwave Links (CMLs) offer dense spatial coverage for rainfall sensing but produce path-integrated measurements that make accurate ground-level reconstruction challenging. Existing methods typically oversimplify CMLs as point sensors and neglect line integration relating rainfall to signal attenuation, resulting in degraded performance under heterogeneous precipitation. In this work, we view rain field reconstruction as a Bayesian inverse problem with Diffusion Models (DMs) as high-fidelity spatial priors. We show that diffusion models better preserve key rainfall statistics compared to censored Gaussian processes. Framing rainfall estimation as a Bayesian inverse problem with a DM prior enables training-free posterior sampling using a broad family of methods, including Plug-and-Play, Sequential Monte Carlo, and Replica Exchange methods. Experiments on synthetic and real-world datasets demonstrate consistent improvements over established CML-based reconstruction baselines.

artificial intelligence, bayesian rain field reconstruction, machine learning, (12 more...)

2605.0552

Country:

Europe (0.93)
North America > United States (0.46)
Asia > Middle East (0.28)

Genre: Research Report > New Finding (0.92)

Industry:

Energy (0.46)
Law > Civil Rights & Constitutional Law (0.35)

Permutation-preserving Functions and Neural Vecchia Covariance Kernels

Cao, Jian, Liu, Nian, Lin, Ying

We introduce a novel framework for constructing scalable and flexible covariance kernels for Gaussian processes (GPs) by directly learning the covariance structure under a regression-type parameterization induced by Vecchia approximations, using deep neural architectures. Specifically, we model kriging coefficients and conditional standard deviations, deterministic quantities that uniquely characterize the covariance, providing stable and informative learning targets. Exploiting the permutation-equivariant structure of conditioning sets in the Vecchia factorization, we derive a universal representation for permutation-preserving functions and design neural architectures that respect this symmetry, leading to improved training stability and data efficiency. The proposed approach enables expressive, non-stationary kernel learning while maintaining computational scalability, thereby bridging classical GP methodology with modern deep learning.

artificial intelligence, deep learning, machine learning, (16 more...)

2605.05523

Country: North America > United States > Texas > Harris County > Houston (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Oh, Sunmin, Oh, Sang-Yun, Park, Gunwoong

Relaxed Sparsest-Permutation Formulation for Causal Discovery at Scale

Despite the growing availability of large datasets, causal structure learning remains computationally prohibitive at scale. We revisit sparsest-permutation learning for linear structural equation models and show that exact Cholesky factorization is unnecessary for structure recovery. This observation motivates a support-level relaxation that searches for sparse triangular factors over a precision-support screening graph. The relaxed formulation can be efficiently evaluated via masked zero-fill incomplete Cholesky factorization, enabling scalable comparison of candidate orderings. At the population level, we establish soundness for Markov equivalence class (MEC) recovery under no-cancellation and sparsest Markov representation assumptions, as well as robustness to ordering misspecification. Motivated by these guarantees, we introduce SCOPE, a sparse-Cholesky pipeline that provides a scalable implementation of the relaxed formulation. Experiments on synthetic and real datasets demonstrate that SCOPE matches the MEC recovery accuracy of substantially slower baselines, while achieving significantly reduced runtime and scaling to 10k variables.

artificial intelligence, machine learning, optimization problem, (17 more...)

2605.05568

Country: North America > United States > California (0.28)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Therapeutic Area > Oncology (0.68)
Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.66)

Xu, Jianyu, Wang, Yu-Xiang

Optimal Contextual Pricing under Agnostic Non-Lipschitz Demand

We study contextual dynamic pricing with linear valuations and bounded-support agnostic noise, whose induced demand curve may be non-Lipschitz with arbitrary jumps and atoms. Such discontinuities break the cross-context interpolation arguments used by smooth-demand pricing algorithms, while the best previous method achieved only $\tilde O(T^{3/4})$ regret. We propose Conservative-Markdown Redirect-UCB Pricing, a polynomial-time algorithm that combines randomized parameter estimation, conservative residual-grid probing, and confidence-based one-step redirection. Our algorithm achieves $\tilde O(T^{2/3})$ optimal regret, matching the known lower bounds of Kleinberg and Leighton (2003) up to logarithmic factors and improving over the previous upper bound of Xu and Wang (2022). Under stochastic well-conditioned contexts, this closes the long-existing open regret gap in linear-valuation contextual pricing under agnostic non-Lipschitz noise distribution.

artificial intelligence, machine learning, pricing, (16 more...)

2605.05609

Country: North America > United States > California (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)

Liu, Andy Zeyi, Paquette, Elliot, Sous, John

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded NanoGPT codebase, we introduce an empirical protocol based on activation covariance and per-sample gradient SVD spectra. This dual-view reveals three empirical findings and one mechanistic explanation. First, batch size acts as a latent determinant of representation geometry: runs that reach equal loss settle into systematically distinct activation spectra. Second, the activation covariance tail measured early in training reliably forecasts downstream token efficiency. Third, movement of the activation spectrum head (leading modes), together with gradient spectra, characterizes underlying learning-dynamics changes, separating learning-side architectural improvements from primarily execution-side gains. These predictive and diagnostic signals persist across the 12-, 36-, and 48-layer model tiers. Finally, a mechanistic model proves the main observations and explains how activation covariance spectra correlate with task-aligned feature learning.

large language model, machine learning, natural language, (19 more...)

2605.05683

Country: North America > United States > Connecticut > New Haven County (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Convex-Geometric Error Bounds for Positive-Weight Kernel Quadrature

Hayakawa, Satoshi

Kernel quadrature (KQ) is a kernel-based approach to numerical integration, closely related to Bayesian quadrature (BQ) and probabilistic integration [38, 39, 10]. For sufficiently regular integrands, KQ can exploit spectral structure in a reproducing kernel Hilbert space (RKHS) that is invisible to plain Monte Carlo and thereby converge faster than the usual O(N 1/2) rate in the number of points [3, 28]. Unconstrained kernel-based rules, however, may produce numerically unstable weights, motivating longstanding interest in positively weighted rules [13, 21, 29, 46]. In this paper, positive weights mean nonnegative weights that sum to one, i.e., simplex or convex-combination weights. Whether positive-weight KQ can systematically improve over Monte Carlo is a subtle question. Kernel herding and related constructions suggested fast rates under favorable assumptions [13], but the conditional-gradient viewpoint of Bach et al. [4] clarified that the strongest such assumptions are not generally available in infinite-dimensional RKHSs. Subsequent herding-type analyses in broad RKHS settings have therefore mostly remained at the Monte-Carlo scale, except under additional structure or modified algorithms such as sparse herding variants [31, 44, 43]. Beyond herding, subsampling-based positive KQ methods such as thinning [16, 15] and recombination [21, 24] have obtained rates beyond Monte Carlo, but a general mechanism for such improvement in the simple i.i.d.

approximation, artificial intelligence, machine learning, (16 more...)

2605.05705

Country: Asia > Japan > Honshū > Kantō (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Liang, Haodong, Lai, Lifeng

Transformers Provably Implement In-Context Reinforcement Learning with Policy Improvement

We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-attention transformer block can provably implement policy-improvement methods, including semi-gradient SARSA and actor-critic, via explicit parameter constructions. Beyond existence, we design a teacher-mimicking training procedure, analyze its gradient-flow dynamics, and establish the first convergence guarantee in the ICRL literature: under suitable richness conditions on the training MDP distribution, gradient flow converges locally and exponentially to an optimal parameter manifold corresponding to the desired RL update. Empirically, training transformers on randomly generated tabular MDPs confirms these predictions: the learned models recover the parameter structure of our explicit constructions and, when deployed on unseen MDPs, deliver strong in-context control performance. Together, these results illuminate how transformer architectures internalize and execute classical reinforcement learning algorithms in context, bridging mechanistic understanding and training dynamics in ICRL.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

2605.05755

Country: North America > United States > California (0.28)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)