Universal Hypothesis Testing with Kernels: Asymptotically Optimal Tests for Goodness of Fit

arXiv.org Machine Learning

We characterize the asymptotic performance of nonparametric goodness of fit testing, otherwise known as the universal hypothesis testing that dates back to Hoeffding (1965). The exponential decay rate of the type-II error probability is used as the asymptotic performance metric, hence an optimal test achieves the maximum decay rate subject to a constant level constraint on the type-I error probability. We show that two classes of Maximum Mean Discrepancy (MMD) based tests attain this optimality on $\mathbb R^d$, while a Kernel Stein Discrepancy (KSD) based test achieves a weaker one under this criterion. In the finite sample regime, these tests have similar statistical performance in our experiments, while the KSD based test is more computationally efficient. Key to our approach are Sanov's theorem from large deviation theory and recent results on the weak convergence properties of the MMD and KSD.

Learning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures

arXiv.org Machine Learning

The problem of learning tree-structured Gaussian graphical models from independent and identically distributed (i.i.d.) samples is considered. The influence of the tree structure and the parameters of the Gaussian distribution on the learning rate as the number of samples increases is discussed. Specifically, the error exponent corresponding to the event that the estimated tree structure differs from the actual unknown tree structure of the distribution is analyzed. Finding the error exponent reduces to a least-squares problem in the very noisy learning regime. In this regime, it is shown that the extremal tree structure that minimizes the error exponent is the star for any fixed set of correlation coefficients on the edges of the tree. If the magnitudes of all the correlation coefficients are less than 0.63, it is also shown that the tree structure that maximizes the error exponent is the Markov chain. In other words, the star and the chain graphs represent the hardest and the easiest structures to learn in the class of tree-structured Gaussian graphical models. This result can also be intuitively explained by correlation decay: pairs of nodes which are far apart, in terms of graph distance, are unlikely to be mistaken as edges by the maximum-likelihood estimator in the asymptotic regime.

Lossy Compression via Sparse Linear Regression: Performance under Minimum-distance Encoding

arXiv.org Machine Learning

We study a new class of codes for lossy compression with the squared-error distortion criterion, designed using the statistical framework of high-dimensional linear regression. Codewords are linear combinations of subsets of columns of a design matrix. Called a Sparse Superposition or Sparse Regression codebook, this structure is motivated by an analogous construction proposed recently by Barron and Joseph for communication over an AWGN channel. For i.i.d Gaussian sources and minimum-distance encoding, we show that such a code can attain the Shannon rate-distortion function with the optimal error exponent, for all distortions below a specified value. It is also shown that sparse regression codes are robust in the following sense: a codebook designed to compress an i.i.d Gaussian source of variance $\sigma^2$ with (squared-error) distortion $D$ can compress any ergodic source of variance less than $\sigma^2$ to within distortion $D$. Thus the sparse regression ensemble retains many of the good covering properties of the i.i.d random Gaussian ensemble, while having having a compact representation in terms of a matrix whose size is a low-order polynomial in the block-length.

A Polynomial-Time Deterministic Approach to the Traveling Salesperson Problem

arXiv.org Artificial Intelligence

We propose a new polynomial-time deterministic algorithm that produces an approximated solution for the traveling salesperson problem. The proposed algorithm ranks cities based on their priorities calculated using a power function of means and standard deviations of their distances from other cities and then connects the cities to their neighbors in the order of their priorities. When connecting a city, a neighbor is selected based on their neighbors' priorities calculated as another power function that additionally includes their distance from the focal city to be connected. This repeats until all the cities are connected into a single loop. The time complexity of the proposed algorithm is $O(n^2)$, where $n$ is the number of cities. Numerical evaluation shows that, despite its simplicity, the proposed algorithm produces shorter tours with less time complexity than other conventional tour construction heuristics. The proposed algorithm can be used by itself or as an initial tour generator for other more complex heuristic optimization algorithms.