Sutherland, Dougal J., Xiong, Liang, Póczos, Barnabás, Schneider, Jeff

Most machine learning algorithms, such as classification or regression, treat the individual data point as the object of interest. Here we consider extending machine learning algorithms to operate on groups of data points. We suggest treating a group of data points as an i.i.d. sample set from an underlying feature distribution for that group. Our approach employs kernel machines with a kernel on i.i.d. sample sets of vectors. We define certain kernel functions on pairs of distributions, and then use a nonparametric estimator to consistently estimate those functions based on sample sets. The projection of the estimated Gram matrix to the cone of symmetric positive semi-definite matrices enables us to use kernel machines for classification, regression, anomaly detection, and low-dimensional embedding in the space of distributions. We present several numerical experiments both on real and simulated datasets to demonstrate the advantages of our new approach.

Singh, Shashank, Póczos, Barnabás

Estimating divergences in a consistent way is of great importance in many machine learning tasks. Although this is a fundamental problem in nonparametric statistics, to the best of our knowledge there has been no finite sample exponential inequality convergence bound derived for any divergence estimators. The main contribution of our work is to provide such a bound for an estimator of R\'enyi-$\alpha$ divergence for a smooth H\"older class of densities on the $d$-dimensional unit cube $[0, 1]^d$. We also illustrate our theoretical results with a numerical experiment.

Singh, Shashank, Poczos, Barnabas

We provide finite-sample analysis of a general framework for using k-nearest neighbor statistics to estimate functionals of a nonparametric continuous probability density, including entropies and divergences. Rather than plugging a consistent density estimate (which requires k → ∞ as the sample size n → ∞) into the functional of interest, the estimators we consider fix k and perform a bias correction. This can be more efficient computationally, and, as we show, statistically, leading to faster convergence rates. Our framework unifies several previous estimators, for most of which ours are the first finite sample guarantees.

The problem of f-divergence estimation is important in the fields of machine learning, information theory, and statistics. While several divergence estimators exist, relatively few have known convergence properties. In particular, even for those estimators whose MSE convergence rates are known, the asymptotic distributions are unknown. We establish the asymptotic normality of a recently proposed ensemble estimator of f-divergence between two distributions from a finite number of samples. This estimator has MSE convergence rate of O(1/T), is simple to implement, and performs well in high dimensions. This theory enables us to perform divergence-based inference tasks such as testing equality of pairs of distributions based on empirical samples. We experimentally validate our theoretical results and, as an illustration, use them to empirically bound the best achievable classification error.

Singh, Shashank, Póczos, Barnabás

We provide finite-sample analysis of a general framework for using k-nearest neighbor statistics to estimate functionals of a nonparametric continuous probability density, including entropies and divergences. Rather than plugging a consistent density estimate (which requires $k \to \infty$ as the sample size $n \to \infty$) into the functional of interest, the estimators we consider fix k and perform a bias correction. This is more efficient computationally, and, as we show in certain cases, statistically, leading to faster convergence rates. Our framework unifies several previous estimators, for most of which ours are the first finite sample guarantees.