Goto

Collaborating Authors

 lemma 7


Uniform-in-Time Weak Propagation-of-Chaos in Shallow Neural Networks

arXiv.org Machine Learning

We consider one-hidden layer neural networks trained in the feature-learning regime using gradient descent, and relate the output of the finite-width network $f_{\hatฯ_t^m}$ to its infinite-width counterpart $f_{ฯ_t^{MF}}$, which evolves in the mean-field dynamics. While constant-time horizon bounds for $\|f_{ฯ_t^{MF}} - f_{\hatฯ_t^m}\|$ may be obtained via standard Grรถnwall estimates, the long-time behavior of the fluctuation is a more delicate matter. Uniform-in-time bounds often rely on (local) strong convexity in the landscape or Logarithmic Sobolev inequalities present in noisy gradient dynamics. In this work, we establish non-asymptotic weak propagation-of-chaos that holds uniformly in time, obtained by exploiting instead the convergence rate of the mean-field deterministic Wasserstein-gradient-flow dynamics. Specifically, denoting by $L_t$ the mean-field excess MSE loss at time $t$ and $m$ the number of neurons, under standard regularity assumptions and the condition $\int_0^\infty L_t^{1/2} dt =O(\log d)$, we obtain the uniform in time bound $\|f_{ฯ_t^{MF}}- f_{\hatฯ_t^m}\|^2 \lesssim \text{poly}(d) m^{-\min(1,c/6)}$ whenever $L_t \lesssim t^{-c}$. Our result holds in a noiseless setting and does not make any assumptions on the geometry of the landscape near the optimum, and extends seamlessly to other forms of discretization, including finite number of samples and time discretization. A key takeaway of our result is that whenever the convergence rate of the mean-field, population-loss dynamics is faster than $t^{-2}$, we can attain a loss of $ฮต$ with only $\text{poly}(d/ฮต)$ neurons, training samples, and GD steps.


On efficient robust regression with subquadratic samples

arXiv.org Machine Learning

We revisit the problem of robust linear regression under Gaussian covariates with an unknown covariance matrix of condition number $ฮบ$. For this fundamental problem, significant gaps remain in our understanding of the trade-offs among sample complexity, condition number, runtime, and prediction error for efficient algorithms. Our first result is a near-linear-time algorithm that uses $\widetilde{O}(d/ฮต^4)$ samples, where $d$ is the dimension and $ฮต$ is the corruption rate, and achieves prediction error $O(\sqrt{ฮตฮบ})$ under the condition $ฮตฮบ\lesssim 1$, improving over all prior works. We complement this result with a Statistical Query (SQ) lower bound showing that efficient SQ algorithms achieving error $o(\sqrt{ฮตฮบ})$ when $ฮตฮบ\lesssim 1$ require queries that take $ฮฉ(d^2)$ samples to simulate. Finally, we prove a low-degree polynomial lower bound that gives fine-grained evidence that, without assumptions such as $ฮตฮบ\lesssim 1$, efficient algorithms may require $\tildeฮฉ\left(\min\{dฮต^{2}ฮบ^{2},\ ฮต^{2}d^{2}\}\right)$ samples to significantly outperform the trivial estimator that always guesses $0$.


Supplementary Material for: An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

Neural Information Processing Systems

We first verify the statement for the terminal state f. Observe that at the terminal state f, regardless of the action taken, the next state is always f and the reward is always 0. Hence Q h(f,) = V h(f) = 0 for all h [H]. Thus Q h(f,) = hฯ†(f,),v(a)i= 0. We now verify realizability for other states via induction on h = H,H 1,,1. Next, note that h, (2) follows from (1). In other words, (1) implies that a is always the optimal action.



1c71cd4032da425409d8ada8727bad42-Supplemental-Conference.pdf

Neural Information Processing Systems

We can see that the error for the first term is mainly due to the sample approximation. We therefore refer to the first term as the Variance. We refer to the second term as the Bias. Our proof of convergence of the bias adapts the proof in [31, Theorem 6] and [11], and utilizes the fact that CY|X is Hilbert-Schmidt to obtain a sharp rate. A.1 Bounding the Bias In this section, we establish the bound on the bias.


where the last inequality follows from the fact that Uij 1. Also, for any i [n ] and j [k], we have xi bยตj

Neural Information Processing Systems

To prove Lemma 2 we start by proving a few inequalities. Since Ais an ( 1, 2,Q)-solver, using Definition 4 and Taylor's expansion, we get for any i [n] and j [k], In this section we present and prove a few auxiliary results which will be used in the proofs our main results. We start with the following standard concentration inequalities. R2, (32) if n clog(1/ฮด)2, where c > 0 is some absolute constant. The following locality lemma states that the fuzzy k-means function is strictly increasing. Lemma 5. Let (X,P?) be a clustering instance, where P? refers to the optimal solution for the fuzzy k-mean problem (namely, minimizes the objective in (2)). Output: bยตj 1: Initialize S ฯ†. 2: for s= 1,2,...,mdo 3: Sample iuniformly at random from [n] and update S S {i}. Next, we analyze the performance of Algorithm 6, which estimates the center of a given cluster using a set of randomly sampled elements. Note that this algorithm is used as a sub-routine in Algorithm 1. Lemma 6 (Estimate of mean using uniform sampling). Let (X,P) be a consistent center-based clustering instance, and let ฮด (0,1).


Denoising distances beyond the volumetric barrier

arXiv.org Machine Learning

We study the problem of reconstructing the latent geometry of a $d$-dimensional Riemannian manifold from a random geometric graph. While recent works have made significant progress in manifold recovery from random geometric graphs, and more generally from noisy distances, the precision of pairwise distance estimation has been fundamentally constrained by the volumetric barrier, namely the natural sample-spacing scale $n^{-1/d}$ coming from the fact that a generic point of the manifold typically lies at distance of order $n^{-1/d}$ from the nearest sampled point. In this paper, we introduce a novel approach, Orthogonal Ring Distance Estimation Routine (ORDER), which achieves a pointwise distance estimation precision of order $n^{-2/(d+5)}$ up to polylogarithmic factors in $n$ in polynomial time. This strictly beats the volumetric barrier for dimensions $d > 5$. As a consequence of obtaining pointwise precision better than $n^{-1/d}$, we prove that the Gromov--Wasserstein distance between the reconstructed metric measure space and the true latent manifold is of order $n^{-1/d}$. This matches the Wasserstein convergence rate of empirical measures, demonstrating that our reconstructed graph metric is asymptotically as good as having access to the full pairwise distance matrix of the sampled points. Our results are proven in a very general setting which includes general models of noisy pairwise distances, sparse random geometric graphs, and unknown connection probability functions.