Goto

Collaborating Authors

 Mathematical & Statistical Methods



I Background in Linear Algebra

Neural Information Processing Systems

In this section we state some elementary results that we will use for our main proofs. The next Lemma is part of the proof of [44, Lemma 4.2], which we state here as a separate result to save some space from the longer proofs that follow later. This is part of the proof of [44, Lemma 4.2]. In this section we specialize the definitions to the case of Gaussian matrices. Lemma 7. Let n 1 be an integer, and δ (0, 1/2) .



Kernel methods through the roof: handling billions of points efficiently

Neural Information Processing Systems

It is not a surprise that kernel methods are among the most theoretically studied models. From a numerical point of view, they reduce to convex optimization problems that can be solved with strong guarantees.



Convergence rates of sub-sampled Newton methods

Neural Information Processing Systems

We consider the problem of minimizing a sum of $n$ functions via projected iterations onto a convex parameter set $\C \subset \reals^p$, where $n\gg p\gg 1$. In this regime, algorithms which utilize sub-sampling techniques are known to be effective.In this paper, we use sub-sampling techniques together with low-rank approximation to design a new randomized batch algorithm which possesses comparable convergence rate to Newton's method, yet has much smaller per-iteration cost. The proposed algorithm is robust in terms of starting point and step size, and enjoys a composite convergence rate, namely, quadratic convergence at start and linear convergence when the iterate is close to the minimizer. We develop its theoretical analysis which also allows us to select near-optimal algorithm parameters. Our theoretical results can be used to obtain convergence rates of previously proposed sub-sampling based algorithms as well. We demonstrate how our results apply to well-known machine learning problems.Lastly, we evaluate the performance of our algorithm on several datasets under various scenarios.


A variational approach to dimension-free self-normalized concentration

arXiv.org Machine Learning

We study the self-normalized concentration of vector-valued stochastic processes. We focus on bounds for sub-$ψ$ processes, a tail condition that encompasses a wide variety of well-known distributions (including sub-exponential, sub-Gaussian, sub-gamma, and sub-Poisson distributions). Our results recover and generalize the influential bound of Abbasi-Yadkori et al. (2011) and fill a gap in the literature between determinant-based bounds and those based on condition numbers. As applications we prove a Bernstein inequality for random vectors satisfying a moment condition (which is more general than boundedness), and also provide the first dimension-free, self-normalized empirical Bernstein inequality. Our techniques are based on the variational (PAC-Bayes) approach to concentration.


Can SGD Handle Heavy-Tailed Noise?

arXiv.org Artificial Intelligence

Stochastic Gradient Descent (SGD) is a cornerstone of large-scale optimization, yet its theoretical behavior under heavy-tailed noise -- common in modern machine learning and reinforcement learning -- remains poorly understood. In this work, we rigorously investigate whether vanilla SGD, devoid of any adaptive modifications, can provably succeed under such adverse stochastic conditions. Assuming only that stochastic gradients have bounded $p$-th moments for some $p \in (1, 2]$, we establish sharp convergence guarantees for (projected) SGD across convex, strongly convex, and non-convex problem classes. In particular, we show that SGD achieves minimax optimal sample complexity under minimal assumptions in the convex and strongly convex regimes: $\mathcal{O}(\varepsilon^{-\frac{p}{p-1}})$ and $\mathcal{O}(\varepsilon^{-\frac{p}{2(p-1)}})$, respectively. For non-convex objectives under Hölder smoothness, we prove convergence to a stationary point with rate $\mathcal{O}(\varepsilon^{-\frac{2p}{p-1}})$, and complement this with a matching lower bound specific to SGD with arbitrary polynomial step-size schedules. Finally, we consider non-convex Mini-batch SGD under standard smoothness and bounded central moment assumptions, and show that it also achieves a comparable $\mathcal{O}(\varepsilon^{-\frac{2p}{p-1}})$ sample complexity with a potential improvement in the smoothness constant. These results challenge the prevailing view that heavy-tailed noise renders SGD ineffective, and establish vanilla SGD as a robust and theoretically principled baseline -- even in regimes where the variance is unbounded.


Mathematical Foundations of Geometric Deep Learning

arXiv.org Artificial Intelligence

Since the dawn of civilization, humans have tried to understand the nature of intelligence. With the advent of computers, there have been attempts to emulate human intelligence using computer algorithms - a field that was dubbed'Artificial Intelligence' or'AI' by the computer scientist John McCarthy in 1956 and has recently enjoyed an explosion of popularity. Many efforts in AI research have focused on the study and replication of what is considered the hallmark of human cognition, such as playing intelligent games, the faculty of language, visual perception, and creativity. While at the time of writing we have multiple successful takes at the above - computers nowadays play chess and Go better than any human, can translate English into Chinese without a dictionary, automatically drive a car in a crowded city, and generate poetry and art that wins artistic competitions - it is fair to say that we still do not have a full understanding of what human-like or'general' intelligence entails and how to replicate it.


Fast Gaussian process inference by exact Matérn kernel decomposition

arXiv.org Machine Learning

To speed up Gaussian process inference, a number of fast kernel matrix-vector multiplication (MVM) approximation algorithms have been proposed over the years. In this paper, we establish an exact fast kernel MVM algorithm based on exact kernel decomposition into weighted empirical cumulative distribution functions, compatible with a class of kernels which includes multivariate Matérn kernels with half-integer smoothness parameter. This algorithm uses a divide-and-conquer approach, during which sorting outputs are stored in a data structure. We also propose a new algorithm to take into account some linear fixed effects predictor function. Our numerical experiments confirm that our algorithm is very effective for low-dimensional Gaussian process inference problems with hundreds of thousands of data points. An implementation of our algorithm is available at https://gitlab.com/warin/fastgaussiankernelregression.git.