isoperimetry
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > Middle East > Jordan (0.05)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Europe > France > Grand Est > Bas-Rhin > Strasbourg (0.04)
- (2 more...)
A Universal Law of Robustness via Isoperimetry
Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in the current practice of deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is {\em necessary} if one wants to interpolate the data {\em smoothly}. Namely we show that {\em smooth} interpolation requires $d$ times more parameters than mere interpolation, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with polynomial size weights, and any covariate distribution verifying isoperimetry. In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by Bubeck, Li and Nagaraj. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
A Universal Law of Robustness via Isoperimetry
Classically, data interpolation with a parametrized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in the current practice of deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is {\em necessary} if one wants to interpolate the data {\em smoothly}. Namely we show that {\em smooth} interpolation requires d times more parameters than mere interpolation, where d is the ambient data dimension.
Parallel simulation for sampling under isoperimetry and score-based diffusion models
Zhou, Huanjian, Sugiyama, Masashi
In recent years, there has been a surge of interest in proving discretization bounds for sampling under isoperimetry and for diffusion models. As data size grows, reducing the iteration cost becomes an important goal. Inspired by the great success of the parallel simulation of the initial value problem in scientific computation, we propose parallel Picard methods for sampling tasks. Rigorous theoretical analysis reveals that our algorithm achieves better dependence on dimension $d$ than prior works in iteration complexity (i.e., reduced from $\widetilde{O}(\log^2 d)$ to $\widetilde{O}(\log d)$), which is even optimal for sampling under isoperimetry with specific iteration complexity. Our work highlights the potential advantages of simulation methods in scientific computation for dynamics-based sampling and diffusion models.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
Spectral Clustering on Large Datasets: When Does it Work? Theory from Continuous Clustering and Density Cheeger-Buser
Chu, Timothy, Miller, Gary, Walkington, Noel
Spectral clustering is one of the most popular clustering algorithms that has stood the test of time. It is simple to describe, can be implemented using standard linear algebra, and often finds better clusters than traditional clustering algorithms like $k$-means and $k$-centers. The foundational algorithm for two-way spectral clustering, by Shi and Malik, creates a geometric graph from data and finds a spectral cut of the graph. In modern machine learning, many data sets are modeled as a large number of points drawn from a probability density function. Little is known about when spectral clustering works in this setting -- and when it doesn't. Past researchers justified spectral clustering by appealing to the graph Cheeger inequality (which states that the spectral cut of a graph approximates the ``Normalized Cut''), but this justification is known to break down on large data sets. We provide theoretically-informed intuition about spectral clustering on large data sets drawn from probability densities, by proving when a continuous form of spectral clustering considered by past researchers (the unweighted spectral cut of a probability density) finds good clusters of the underlying density itself. Our work suggests that Shi-Malik spectral clustering works well on data drawn from mixtures of Laplace distributions, and works poorly on data drawn from certain other densities, such as a density we call the `square-root trough'. Our core theorem proves that weighted spectral cuts have low weighted isoperimetry for all probability densities. Our key tool is a new Cheeger-Buser inequality for all probability densities, including discontinuous ones.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (12 more...)
Federated Learning with a Sampling Algorithm under Isoperimetry
Federated learning uses a set of techniques to efficiently distribute the training of a machine learning algorithm across several devices, who own the training data. These techniques critically rely on reducing the communication cost – the main bottleneck – between the devices and a central server. Federated learning algorithms usually take an optimization approach: they are algorithms for minimizing the training loss subject to communication (and other) constraints. In this work, we instead take a Bayesian approach for the training task, and propose a communication-efficient variant of the Langevin algorithm to sample a posteriori. The latter approach is more robust and provides more knowledge of the a posteriori distribution than its optimization counterpart.
A Universal Law of Robustness via Isoperimetry
Bubeck, Sébastien, Sellke, Mark
We propose an explanation for this enigmatic phenomenon, showing in great generality that finding a smooth function to fit d-dimensional data requires at least nd parameters. In other words, overparametrization by a factor of d is necessary for smooth interpolation, suggesting that perhaps the large size of the models used in deep learning is a necessity rather than a weakness of the framework. Another way to phrase the result is as a tradeoff between the size of a model (as measured by the number of parameters) and its "robustness" (as measured by its Lipschitz constant): either one has a small model (with n parameters) which must then be non-robust, or one has a robust model (constant Lipschitz) but then it must be very large (with nd parameters). Such a tradeoff was conjectured for the specific case of two-layer neural networks and Gaussian data in [BLN21]. Our result shows that in fact it is a universal phenomenon, which applies to essentially any parametrized function class (including in particular deep neural networks) as well as a much broader class of data distributions.