Goto

Collaborating Authors

Learning in High Dimensional Spaces


Large-scale optimal transport map estimation using projection pursuit

Neural Information Processing Systems

This paper studies the estimation of large-scale optimal transport maps (OTM), which is a well known challenging problem owing to the curse of dimensionality. Existing literature approximates the large-scale OTM by a series of one-dimensional OTM problems through iterative random projection. Such methods, however, suffer from slow or none convergence in practice due to the nature of randomly selected projection directions. Instead, we propose an estimation method of large-scale OTM by combining the idea of projection pursuit regression and sufficient dimension reduction. The proposed method, named projection pursuit Monge map (PPMM), adaptively selects the most "informative" projection direction in each iteration. We theoretically show the proposed dimension reduction method can consistently estimate the most "informative" projection direction in each iteration.


Quantifying the Empirical Wasserstein Distance to a Set of Measures: Beating the Curse of Dimensionality

Neural Information Processing Systems

We consider the problem of estimating the Wasserstein distance between the empirical measure and a set of probability measures whose expectations over a class of functions (hypothesis class) are constrained. If this class is sufficiently rich to characterize a particular distribution (e.g., all Lipschitz functions), then our formulation recovers the Wasserstein distance to such a distribution. We establish a strong duality result that generalizes the celebrated Kantorovich-Rubinstein duality. We also show that our formulation can be used to beat the curse of dimensionality, which is well known to affect the rates of statistical convergence of the empirical Wasserstein distance. In particular, examples of infinite-dimensional hypothesis classes are presented, informed by a complex correlation structure, for which it is shown that the empirical Wasserstein distance to such classes converges to zero at the standard parametric rate. Our formulation provides insights that help clarify why, despite the curse of dimensionality, the Wasserstein distance enjoys favorable empirical performance across a wide range of statistical applications.


979a3f14bae523dc5101c52120c535e9-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for the helpful feedback and the positive assessment of our submission. Reviewer #1, "It is interesting to see if further increase the width of the network (from linear in d to polynomial in d and In the setting of our paper (minimization of the total network size) a large depth is in some sense unavoidable (as e.g. However, in general there is of course some trade-off between width and depth. Assuming a sufficiently constrained family (e.g. a ball in the Barron space Reviewer #4, "Theorem 5.1 extends the approximation results to all piece-wise linear activation functions and not just So in theory, this should also apply to max-outs and other variants of ReLUs such as Leaky ReLUs?" That's right, all these functions are easily expressible one via another using just linear operations (ReLU(x) = Reviewer #4, "I fail to see some intuitions regarding the typical values of r, d, and H for the networks used in practice. T. Poggio et al., Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review.


Sufficient dimension reduction for classification using principal optimal transport direction

Neural Information Processing Systems

Sufficient dimension reduction is used pervasively as a supervised dimension reduction approach. Most existing sufficient dimension reduction methods are developed for data with a continuous response and may have an unsatisfactory performance for the categorical response, especially for the binary-response. To address this issue, we propose a novel estimation method of sufficient dimension reduction subspace (SDR subspace) using optimal transport. The proposed method, named principal optimal transport direction (POTD), estimates the basis of the SDR subspace using the principal directions of the optimal transport coupling between the data respecting different response categories. The proposed method also reveals the relationship among three seemingly irrelevant topics, i.e., sufficient dimension reduction, support vector machine, and optimal transport. We study the asymptotic properties of POTD and show that in the cases when the class labels contain no error, POTD estimates the SDR subspace exclusively. Empirical studies show POTD outperforms most of the state-of-the-art linear dimension reduction methods.


Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds and Benign Overfitting

Neural Information Processing Systems

We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators, and confirms a prediction of Zhou et al. (2020) for near-minimal-norm interpolators in the special case of Gaussian data.


Supplementary material for Locality defeats the curse of dimensionality in teacher student scenarios

Neural Information Processing Systems

Supplementary material for'Locality defeats the curse of dimensionality in convolutional teacher-student scenarios' In this appendix we provide additional details about the derivation of Eq. (8) within the framework of [17, 18]. 's are free to take any value. In short, the replica method works as follows [39]: first one defines an energy function E(f) as the argument of the minimum in Eq. (S1), then attribute to the predictor f a Boltzmann-like probability distribution P (f) = Z Hence, one can replace f in the right-hand side of Eq. (S5) with an average over P (f) at finite β, then perform the limit β after the calculation so as to recover the generalisation error. This intuitive picture can actually be exploited in order to extract the learning curve exponent β from the asymptotic behaviour of Eq. (S6) and Eq. Notice that the risk considered in [17, 18] slightly differs from Eq. (S1) by a factor 1/P in front of the sum.


Nonlinear Sufficient Dimension Reduction with a Stochastic Neural Network

Neural Information Processing Systems

Sufficient dimension reduction is a powerful tool to extract core information hidden in the high-dimensional data and has potentially many important applications in machine learning tasks. However, the existing nonlinear sufficient dimension reduction methods often lack the scalability necessary for dealing with large-scale data. We propose a new type of stochastic neural network under a rigorous probabilistic framework and show that it can be used for sufficient dimension reduction for large-scale data. The proposed stochastic neural network is trained using an adaptive stochastic gradient Markov chain Monte Carlo algorithm, whose convergence is rigorously studied in the paper as well. Through extensive experiments on real-world classification and regression problems, we show that the proposed method compares favorably with the existing state-of-the-art sufficient dimension reduction methods and is computationally more efficient for large-scale data.


A Probabilistic Graph Coupling View of Dimension Reduction

Neural Information Processing Systems

Most popular dimension reduction (DR) methods like t-SNE and UMAP are based on minimizing a cost between input and latent pairwise similarities. Though widely used, these approaches lack clear probabilistic foundations to enable a full understanding of their properties and limitations. To that extent, we introduce a unifying statistical framework based on the coupling of hidden graphs using cross entropy. These graphs induce a Markov random field dependency structure among the observations in both input and latent spaces. We show that existing pairwise similarity DR methods can be retrieved from our framework with particular choices of priors for the graphs. Moreover this reveals that these methods relying on shift-invariant kernels su er from a statistical degeneracy that explains poor performances in conserving coarse-grain dependencies. New links are drawn with PCA which appears as a non-degenerate graph coupling model.


Large-scale optimal transport map estimation using projection pursuit

Neural Information Processing Systems

This paper studies the estimation of large-scale optimal transport maps (OTM), which is a well known challenging problem owing to the curse of dimensionality. Existing literature approximates the large-scale OTM by a series of one-dimensional OTM problems through iterative random projection. Such methods, however, suffer from slow or none convergence in practice due to the nature of randomly selected projection directions. Instead, we propose an estimation method of large-scale OTM by combining the idea of projection pursuit regression and sufficient dimension reduction. The proposed method, named projection pursuit Monge map (PPMM), adaptively selects the most informative'' projection direction in each iteration.


Unsupervised Machine Learning

#artificialintelligence

This course introduces you to one of the main types of Machine Learning: Unsupervised Learning. You will learn how to find insights from data sets that do not have a target or labeled variable. You will learn several clustering and dimension reduction algorithms for unsupervised learning as well as how to select the algorithm that best suits your data. The hands-on section of this course focuses on using best practices for unsupervised learning. By the end of this course you should be able to: Explain the kinds of problems suitable for Unsupervised Learning approaches Explain the curse of dimensionality, and how it makes clustering difficult with many features Describe and use common clustering and dimensionality-reduction algorithms Try clustering points where appropriate, compare the performance of per-cluster models Understand metrics relevant for characterizing clusters Who should take this course?