Genre
Optimal Rates for Generalization of Gradient Descent Methods with Deep Neural Networks
Zhou, Junyu, Wang, Puyu, Lei, Yunwen, Ying, Yiming, Zhou, Ding-Xuan
Recent progress has been made in understanding the statistical generalization performance of gradient descent methods for overparameterized neural networks within the neural tangent kernel (NTK) regime. However, most of the existing work on regression problems is limited to shallow network architectures, leaving a notable gap in the theory of deep neural networks. This paper addresses this gap by presenting a comprehensive generalization analysis for deep ReLU networks trained using gradient descent (GD) and stochastic gradient descent (SGD). Specifically, we establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks, under the assumption that the network width scales polynomially with respect to the network depth and training sample size. Our results demonstrate that with sufficient width, gradient descent methods for deep ReLU networks can achieve optimal generalization rates on par with kernel methods.
Automatic, Debiased, and Invariant Counterfactual Generation under General Interventions
Kim, Raphael C, Zhu, Jingsen, Zabih, Ramin, Santacatterina, Michele
Decision-making in complex systems often requires understanding counterfactuals of general, potentially highdimensional, interventions with limited data. Collecting sufficient data for every counterfactual in complex systems may be near impossible due to cost or ethical reasons. With the recent growth in expressivity and power in generative modeling, generative models that can synthesize counterfactual outcomes under generalized interventions stand as a viable solution for supporting robust decision-making in real-world systems. In an ideal world, we may simply train a generative model with the data we have, and sample from the generator under the intervention of interest. Counterfactual generative modeling may fail with such an approach due to confounding bias. Correlations observed in the sampled data may be mistaken for true causal effects, yielding incorrect downstream decisions. For example, generating medical images under changes in intervention dose can help track disease progression and identify optimal dosing strategies. However, if the training data primarily consisted of those who were responsive to intervention (e.g., younger populations), then the generator would identify the ranges in the data as effective even if this does not hold for different populations (e.g.
Principal Component Analysis for Multivariate Extremes
Cooley, Dan, Sabourin, Anne, Wixson, Troy
Background on Principal Component Analysis Principal component analysis (PCA) is a method widely used by practitioners for learning features of high-dimensional data [15]. It is a dimension reduction technique that represents the data in lower dimensions, often with the aim of exploratory analysis or visualization. PCA can also be used as a data preprocessing step, for instance in regression analysis. While PCA is familiar and commonplace for understanding behavior in the data's'bulk', only recently have similar methods been proposed for understanding high-dimensional extremes. The aim of this chapter is to review and compare recent approaches for extremal PCA. 1
Finding Most Influential Sets
Konrad, Lucas D., Kuschnig, Nikolas
Identifying most influential sets (MIS) - size-$k$ subsets whose removal maximally changes a target estimand - is typically infeasible because it requires searching over $\binom{n}{k}$ subsets. For estimands with linear-fractional leave-set-out effects, we show that MIS selection reduces to a one-parameter sequence of top-$k$ problems. Dinkelbach's method yields an algorithm with $\mathcal{O}(n)$ cost per iteration and finite termination. For fixed residualized inputs, the algorithm returns a globally optimal set for the univariate ratio objective, including the oracle-residualized partial linear model. With estimated nuisance functions, uniform denominator and generated-score stability imply approximation to the first-order oracle orthogonal-score objective; exact set recovery follows under a separation condition. Simulations and applications show that the method recovers exact MIS that were previously computationally inaccessible.
The Download: AI hacking beyond Mythos, and chatbots' impact on our brains
Plus: Anthropic has called for a global slowdown in AI development. The Meta hack shows there's more to AI security than Mythos On Monday, reports emerged that attackers had used Meta's AI customer support agent to steal Instagram accounts. Their approach was simple: they asked the agent to link the accounts to email addresses they controlled, and it complied. Since Anthropic announced that its Mythos model was too good at hacking for a general release, cybersecurity concerns have focused on the risk of superpowered AI systems overwhelming computer infrastructure. But the Instagram hack shows that far simpler exploits can still cause damage. As companies offload more work to AI, these comparatively unsophisticated attacks are becoming harder to ignore.
Dead Directions: Geometric Singular Learning
Singular learning theory and information geometry have studied the same parameter spaces in mostly separate vocabularies: the former computes Bayesian invariants in resolved coordinates, the latter works in original coordinates under a non-degeneracy assumption that overparameterised models routinely violate. We bridge them through one primitive, the dead direction: a unit vector along which the Fisher metric degenerates, equivalently a tangent to the analytic singular set with a definite KL order, set by how fast the KL divergence vanishes. The two readings name the same vector; our central move shows its KL order is recoverable as the decay rate of the directional Fisher curvature approaching the singularity, in original parameter coordinates and without a Hironaka resolution. A selection rule on smooth fibres translates this rate into Watanabe's single-direction contribution to the real log canonical threshold, and we extend the recovery to multi-component crossings, multiplicity $m$, the singular fluctuation $ฮฝ$ (universal in the KL order for 1D directions), prior-RLCT shifts, and tempered posteriors. We then lift this rate to a deep network: a multi-layer K-FAC factorisation writes each Fisher block as a product of activation- and gradient-side rates with a duality between them, instantiated at modern-network primitives (residual streams, layer normalisation, attention). A quotient theorem carries the rate to the gauge quotient $ฮ/G$ under gradient flow on a $G$-invariant metric; SGD qualifies, standard Adam does not, and we construct a $G$-equivariant Adam-family preconditioner (DDCAdam) that does. The bridge yields a parameter-coordinate handle on singular geometry, closed-form per-architecture predictions, and a trajectory-rate readout of Watanabe's triple $(ฮป, m, ฮฝ)$ from one checkpoint's forward and backward passes, without posterior sampling.
Multimarginal flow matching with optimal transport potentials
Kansal, Raghav, Crair, David, Nguyen, Nghia, Pope, Scott, Parry, Bradley
Flow matching (FM) has emerged as a powerful framework for learning dynamic transport maps between two empirical distributions. However, less explored is the setting with intermediate observed marginals that can help constrain the flows between the endpoints. This "multimarginal" regime is central to modeling temporal evolution in dynamical systems in many scientific domains that can sample sequential distributions. We tackle this problem with a novel approach that leverages the connection between FM and dynamic optimal transport (OT), softly steering the flow towards the intermediate marginals through potential terms in the dynamic OT action. By extending the conditional FM learning target to incorporate these potentials, we derive an efficient, simulation-free algorithm for multimarginal FM that offers considerable flexibility in the spatiotemporal dynamics of the learned flows. We demonstrate state-of-the-art performance and training efficiency of OT-potential FM (OTP-FM) on diverse single-cell RNA sequencing, oceanographic, and meteorological datasets. Our code is available at https://github.com/Bexorg-Inc/OTP-FM.
Adaptive Learning Rates with Surrogate Probability for Follow-the-Perturbed-Leader
Lee, Jongyeong, Honda, Junya, Ito, Shinji, Kim, Chansoo
Follow-the-regularized-leader framework has shown effectiveness and flexibility in online learning problems, where the choice of learning rates are known to be crucial. Recently, adaptive learning rates defined in terms of the arm-selection probabilities, obtained by solving convex optimization, have achieved improved best-of-both-worlds (BOBW) guarantees in various bandit problems. In contrast, BOBW guarantees for its computationally efficient alternative, follow-the-perturbed-leader (FTPL), remain relatively limited since its optimization-free nature ironically makes the design of adaptive, probability-dependent learning rates non-trivial. To address this challenge, we propose an adaptive learning rate for FTPL by introducing surrogate probability functions that can be computed only from the available quantities, without requiring the exact probabilities. Based on these learning rates with surrogate functions, we provide the BOBW guarantee for FTPL with Pareto perturbations for any shape parameter $ฮฑ>1$, generalizing prior results restricted to specific choices of $ฮฑ=2$. We further show the BOBW guarantees for FTPL with adaptive learning rates in the bandit problem with expert advices. Our approach preserves the computational simplicity of FTPL while enabling probability-dependent adaptivity, and the surrogate-based methodology may be of independent interest in other algorithmic frameworks beyond FTPL and learning rate designs.
Efficient Mean Curvature Computation on High-Dimensional Data Manifolds
Estimating local mean curvature at each point of a high-dimensional dataset is a key ingredient of geometry-aware machine learning algorithms, such as the Mean Curvature Boundary Points (MCBP) method. The naive implementation of this computation, based on a local shape operator approximated from k-nearest neighbor patches, involves an explicit construction of a matrix $H$ whose trace form yields an $O(m^4)$ cost per point, rendering the approach intractable for datasets with more than a few dozen features. This paper introduces two complementary contributions that together reduce this cost by several orders of magnitude. The first contribution is an exact algebraic identity. This identity, derived from the orthogonality of the eigenvectors of the covariance matrix and the cyclicity of the trace operator, eliminates $H$ entirely and reduces the per-point cost to $O(m^2)$ after the eigendecomposition. The second contribution addresses the remaining $O(m^3)$ bottleneck of the full eigendecomposition. Since the local covariance matrix has rank at most $k-1 \ll m$, we replace it with a truncated SVD of the $k \times m$ centered data matrix, an $O(k^2 m)$ operation, and derive an analytical approximation for the contribution of the null-space eigenvectors based on the expected value of their outer product under the Haar measure. The resulting estimator has total cost $O(k^2 m + k m p^2)$, where $p = k-1$. Experiments on real-world datasets confirm speedups of 50 to 300 times relative to the original implementation, with negligible loss when the fast estimator is used to replace the original version. By providing a scalable and data-driven estimate of local curvature, the proposed method establishes curvature as a practical geometric feature for a broad range of machine learning tasks, from classical to modern deep learning pipelines.
A Two-Channel F-Transform Representation for Early Trajectory Characterization in Iterated Correlation Dynamics
Many nonlinear iterative procedures generate high-dimensional trajectories whose early behavior is informative but difficult to compare directly. This paper studies a soft-computing representation problem: how to convert a short early trajectory segment into compact, interpretable, fixed-dimensional fuzzy coordinates that preserve information about subsequent convergence and trajectory geometry. The problem is investigated for iterated Pearson correlation matrices, a nonlinear matrix iteration historically connected with CONCOR-type blockmodeling and repeated-correlation methods. The proposed descriptor uses two logarithmic signals from the early post-transient regime: a step-size signal, measuring contraction magnitude, and a contraction-ratio signal, measuring local contraction evolution. Each signal is projected onto a three-node triangular fuzzy partition using zero-degree F-transform coefficients and one centered first-degree coefficient. This yields an eight-dimensional two-channel representation separating local level from local trend and contraction magnitude from contraction evolution. Across 22 matrix dimensions with 1000 trajectories per dimension, the descriptor is compared with raw trajectory samples, statistical summaries, and PCA-compressed raw features using Random Forest regression for convergence-length approximation. It achieves mean R^2 = 0.6480, close to raw trajectories (0.6518) and statistical summaries (0.6528), while improving over the step-size-only F-transform descriptor (0.5001). Repeated random-split and shifted-window experiments confirm stability. PCA and clustering further show reproducible low-dimensional organization, with the first two principal components explaining 84.26% of variance and k = 3 favored by the mean silhouette criterion.