Goto

Collaborating Authors

 argument


The Biggest Tell That Something Was Written by AI

The Atlantic - Technology

Look closely and you'll see that every part of the text is not quite right. A few weeks ago, where I live in Johannesburg, a man ran a stop sign and crashed into my Subaru. At the scene he was frantic, unable to gather his thoughts. Half an hour later, I received a lengthy, perfectly grammatical text from him elegantly explaining how he perceived the crash had happened. For a repair quote, I wrote to a mechanic I know, a man who used to text me in curt phrases riddled with shorthand.


Mildly Overparameterized ReLU Networks on Orthogonal Data: Incremental Learning and Implicit Bias

arXiv.org Machine Learning

The successful training of neural networks hinges on the use of first order optimization methods, yet the theoretical characterization of these methods remains incomplete. This is especially true in settings with mild overparameterization. In this work, we study the gradient flow dynamics of two-layer ReLU networks from small initialization with orthogonal training data. We prove the limiting flow converges to a saddle-to-saddle jump process as the initialization scale tends to zero, revealing an incremental learning phenomenon in which a new neuron activates at each saddle. This analysis recovers the known result of Dana et al. (2025, arXiv:2502.16977) that the network interpolates the training data with high probability as soon as $m \gtrsim \log(n)$, where $m$ is the network width and $n$ is the number of training samples. This incremental process characterization also allows us to derive a novel implicit bias result: the learned interpolator has a squared $\ell_2$-norm scaling as $\sqrt{n}$, which is within a constant factor of the minimal $\ell_2$-norm interpolator. More broadly, our work provides the first rigorous proof of an incremental learning process for ReLU networks, whilst suggesting mildly overparameterized networks can converge to interpolating solutions whose complexity is of the same order as that of the optimal interpolator.


Optimal Dimension-Free Sampling for Regularized Classification

arXiv.org Machine Learning

We prove optimal sampling bounds achieving $(1\pm\varepsilon)$-relative error for a broad class of Lipschitz continuous classification loss functions under various regularization terms. This includes important functions such as logistic and sigmoid loss, hinge loss, and ReLU loss, as prominent and popular representative examples. In particular, we prove $k^2/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_2/k$ regularization, and $k/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_1/k$ regularization. For $\|\cdot\|_2^2/k$ regularization, the sampling complexity depends mainly on a bounded derivative property: if $|g'(x)|\leq g(x)$, and $g(0)>0$, and $g$ is monotonic or convex, then it admits linear in $k$ sampling complexity; otherwise the general bound is $k^2/\varepsilon^2$. However, if $g(0)=0$, our results indicate that no dimension-free bounds are possible, and even sublinear bounds are ruled out. All upper bounds are complemented by matching lower bounds up to polylogarithmic terms. Moreover, our work relies conceptually and algorithmically on simple uniform or (squared) norm sampling and hereby improves over recent cubic $k^3/\varepsilon^2$ sensitivity sampling bounds of (Alishahi and Phillips, ICML'24). This is achieved by refined arguments involving higher moment bounds and empirical process analyses to avoid overcounting that appears in the de-facto standard VC-dimension and sensitivity framework.


Robust Statistical Estimators with Bounded Empirical Sensitivity

arXiv.org Machine Learning

We introduce a new measure of robustness for statistical estimators, which we call \emph{empirical sensitivity}. An estimator $\hat ฮธ$ has bounded empirical sensitivity if, with high probability over a dataset $X = (X_1, \dots, X_n) \sim \mathcal{D}^{\otimes n}$, for any dataset $Y$ obtained by modifying at most $ฮทn$ points in $X$, we have that $\hat ฮธ(Y)$ is close to $\hat ฮธ(X)$. We study bounds on this quantity for the prototypical problem of Gaussian mean estimation. We prove new lower bounds, showing that for any estimator $\hat ฮผ$ which achieves an optimal $\ell_2$-error bound of $O\left(\sqrt{d/n}\right)$, the empirical sensitivity is at least $ฮฉ\left(ฮท+ \sqrt{ฮทd/n}\right)$. The two terms arise due to obstructions on the mean and variance (via an Efron-Stein argument) of such an estimator. We show that this bound is tight up to logarithmic factors, by employing recent results for robust empirical mean estimation.


Uniform-in-Time Weak Propagation-of-Chaos in Shallow Neural Networks

arXiv.org Machine Learning

We consider one-hidden layer neural networks trained in the feature-learning regime using gradient descent, and relate the output of the finite-width network $f_{\hatฯ_t^m}$ to its infinite-width counterpart $f_{ฯ_t^{MF}}$, which evolves in the mean-field dynamics. While constant-time horizon bounds for $\|f_{ฯ_t^{MF}} - f_{\hatฯ_t^m}\|$ may be obtained via standard Grรถnwall estimates, the long-time behavior of the fluctuation is a more delicate matter. Uniform-in-time bounds often rely on (local) strong convexity in the landscape or Logarithmic Sobolev inequalities present in noisy gradient dynamics. In this work, we establish non-asymptotic weak propagation-of-chaos that holds uniformly in time, obtained by exploiting instead the convergence rate of the mean-field deterministic Wasserstein-gradient-flow dynamics. Specifically, denoting by $L_t$ the mean-field excess MSE loss at time $t$ and $m$ the number of neurons, under standard regularity assumptions and the condition $\int_0^\infty L_t^{1/2} dt =O(\log d)$, we obtain the uniform in time bound $\|f_{ฯ_t^{MF}}- f_{\hatฯ_t^m}\|^2 \lesssim \text{poly}(d) m^{-\min(1,c/6)}$ whenever $L_t \lesssim t^{-c}$. Our result holds in a noiseless setting and does not make any assumptions on the geometry of the landscape near the optimum, and extends seamlessly to other forms of discretization, including finite number of samples and time discretization. A key takeaway of our result is that whenever the convergence rate of the mean-field, population-loss dynamics is faster than $t^{-2}$, we can attain a loss of $ฮต$ with only $\text{poly}(d/ฮต)$ neurons, training samples, and GD steps.


Air France and Airbus found guilty of manslaughter over 2009 plane crash

BBC News

Air France and Airbus have been found guilty of manslaughter over a 2009 plane crash which killed 228 people. The Paris Appeals Court found the airline and aircraft manufacturer guilty of corporate manslaughter over the incident, in which flight AF447 between Rio de Janeiro and Paris crashed into the Atlantic Ocean. The passenger jet stalled during a storm and plunged into the water, killing all on board. A court had previously cleared the companies in April 2023 but they were found guilty after this appeal. The Airbus A330 vanished from radars during a storm, with its wreckage found after a long search of 10,000 sq km (3,860 sq miles) of sea floor.


Optimizing Computational-Statistical Runtime for Wasserstein Distance Estimation

arXiv.org Machine Learning

Squared Wasserstein distance is a frequently used tool to measure discrepancy between probability distributions. This distance is typically computed between empirical measures of size $n$ from two underlying random samples. Unfortunately, even in lower dimensional Euclidean space problems $\left( d \in \{2,3\} \right)$, algorithms for Wasserstein distance computation with approximate or exact precision guarantees scale poorly in the runtime as a function of $n$ and the desired precision. In response, we consider the computational-statistical runtime, where the goal is to estimate from samples the Wasserstein distance between potentially smooth measures up to $ฮต$-additive error in expectation with respect to the sampling; we allow $O(1)$ computational cost for collecting a sample. Towards this, we develop a Sample-Sketch-Solve paradigm where we introduce a regular cartesian grid sketch of the samples. We show that (especially under $ฮฑ$-Hรถlder smooth distributions) this can compress the data without increasing asymptotic error, and also regularizes the structure which enables faster exact algorithms. Ultimately, we approximate $W_2^2(P,Q)$ within $ฮต$ error in $ฮต^{-\max(2,\frac{d+1+o(1)}{1+ฮฑ})}$ time for $0 < ฮฑ< 1$ Hรถlder smooth distributions $P,Q$ on $(0,1)^{d}$; an optimal $ฮ˜(ฮต^{-2})$ for $ฮฑ> 1/2$ when $d=2$ and nearly optimal as $ฮฑ\to 1$ when $d = 3$.


High-stakes courtroom drama of Musk v OpenAI hears closing arguments

The Guardian

OpenAI's CEO, Sam Altman, arrives at the federal courthouse in Oakland, California, on Thursday. OpenAI's CEO, Sam Altman, arrives at the federal courthouse in Oakland, California, on Thursday. Nine-person jury to consider whether AI firm bilked world's richest person and unjustly enriched themselves Closing arguments began on Thursday in Elon Musk's lawsuit against Sam Altman and OpenAI, bringing the weeks-long courtroom battle between the two tech moguls nearer to a decision. A nine-person jury is set to deliberate and return a verdict on whether they believe the AI firm and Altman are liable in the case. The trial, which began last month in an Oakland, California, federal courthouse, has gripped Silicon Valley and featured some of the tech industry's biggest names as witnesses.


Kernel-based guarantees for nonlinear parametric models in Bayesian optimization

arXiv.org Machine Learning

Modern Bayesian optimization and adaptive sampling methods increasingly rely on nonlinear parametric models, yet theoretical guarantees for such models under adaptive data collection remain limited. Existing analyses largely focus on Gaussian processes, kernel machines, linear models, or linearized neural approximations, leaving a gap between theory and the nonlinear models used in practice. We develop a kernel-based framework for analyzing regularized nonlinear parametric models trained on adaptively collected data. Our approach uses kernels over the parameter space to induce reproducing-kernel Hilbert space structures over the corresponding model class, yielding confidence bounds for models trained with broad classes of regularized convex losses. We show how these bounds can support convergence guarantees for nonlinear acquisition and surrogate models, including randomized regularized policies that select points by maximizing a trained random model. These results provide a unified route to analyzing nonlinear parametric models in Bayesian optimization and related adaptive optimization settings.


Uniform Scaling Limits in AdamW-Trained Transformers

arXiv.org Machine Learning

We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hidden states and backpropagated variables converge in $L^2$, uniformly over the initial condition, to the solution of a forward--backward system of ODEs at rate $\mathcal O(L^{-1}+L^{-1/3}H^{-1/2})$. Here, $L$ and $H$ denote the depth and number of heads of the transformer, respectively. The limiting system of ODEs can be identified with a McKean--Vlasov ODE (MVODE) when the attention heads do not incorporate causal masking. By using the flow maps associated with this MVODE and applying concentration of measure techniques, we obtain bounds on the difference between the discrete and continuous models that are uniform over compact sets of initial conditions. As this is achieved without resorting to a covering argument, the constants in our bounds are independent of the number of tokens. Furthermore, under a suitable adaptation to AdamW, the bounds become independent of the token embedding dimension.