isotropic position
Forster Decomposition and Learning Halfspaces with Noise
AForster transform is an operation that turns a distribution into one with good anticoncentration properties. While a Forster transform does not always exist, we show that any distribution can be efficiently decomposed as a disjoint mixture of few distributions for which a Forster transform exists and can be computed efficiently. As the main application of this result, we obtain the first polynomial-time algorithm for distribution-independent PAC learning of halfspaces in the Massart noise model with strongly polynomial sample complexity, i.e., independent of the bit complexity of the examples. Previous algorithms for this learning problem incurred sample complexity scaling polynomially with the bit complexity, even though such a dependence is not information-theoretically necessary.
Dimension Reduction via Sum-of-Squares and Improved Clustering Algorithms for Non-Spherical Mixtures
Anderson, Prashanti, Bafna, Mitali, Buhai, Rares-Darius, Kothari, Pravesh K., Steurer, David
We develop a new approach for clustering non-spherical (i.e., arbitrary component covariances) Gaussian mixture models via a subroutine, based on the sum-of-squares method, that finds a low-dimensional separation-preserving projection of the input data. Our method gives a non-spherical analog of the classical dimension reduction, based on singular value decomposition, that forms a key component of the celebrated spherical clustering algorithm of Vempala and Wang [VW04] (in addition to several other applications). As applications, we obtain an algorithm to (1) cluster an arbitrary total-variation separated mixture of $k$ centered (i.e., zero-mean) Gaussians with $n\geq \operatorname{poly}(d) f(w_{\min}^{-1})$ samples and $\operatorname{poly}(n)$ time, and (2) cluster an arbitrary total-variation separated mixture of $k$ Gaussians with identical but arbitrary unknown covariance with $n \geq d^{O(\log w_{\min}^{-1})} f(w_{\min}^{-1})$ samples and $n^{O(\log w_{\min}^{-1})}$ time. Here, $w_{\min}$ is the minimum mixing weight of the input mixture, and $f$ does not depend on the dimension $d$. Our algorithms naturally extend to tolerating a dimension-independent fraction of arbitrary outliers. Before this work, the techniques in the state-of-the-art non-spherical clustering algorithms needed $d^{O(k)} f(w_{\min}^{-1})$ time and samples for clustering such mixtures. Our results may come as a surprise in the context of the $d^{\Omega(k)}$ statistical query lower bound [DKS17] for clustering non-spherical Gaussian mixtures. While this result is usually thought to rule out $d^{o(k)}$ cost algorithms for the problem, our results show that the lower bounds can in fact be circumvented for a remarkably general class of Gaussian mixtures.
Beyond Parallel Pancakes: Quasi-Polynomial Time Guarantees for Non-Spherical Gaussian Mixtures
Buhai, Rares-Darius, Steurer, David
We consider mixtures of $k\geq 2$ Gaussian components with unknown means and unknown covariance (identical for all components) that are well-separated, i.e., distinct components have statistical overlap at most $k^{-C}$ for a large enough constant $C\ge 1$. Previous statistical-query lower bounds [DKS17] give formal evidence that even distinguishing such mixtures from (pure) Gaussians may be exponentially hard (in $k$). We show that this kind of hardness can only appear if mixing weights are allowed to be exponentially small, and that for polynomially lower bounded mixing weights non-trivial algorithmic guarantees are possible in quasi-polynomial time. Concretely, we develop an algorithm based on the sum-of-squares method with running time quasi-polynomial in the minimum mixing weight. The algorithm can reliably distinguish between a mixture of $k\ge 2$ well-separated Gaussian components and a (pure) Gaussian distribution. As a certificate, the algorithm computes a bipartition of the input sample that separates a pair of mixture components, i.e., both sides of the bipartition contain most of the sample points of at least one component. For the special case of colinear means, our algorithm outputs a $k$ clustering of the input sample that is approximately consistent with the components of the mixture. A significant challenge for our results is that they appear to be inherently sensitive to small fractions of adversarial outliers unlike most previous results for Gaussian mixtures. The reason is that such outliers can simulate exponentially small mixing weights even for mixtures with polynomially lower bounded mixing weights. A key technical ingredient is a characterization of separating directions for well-separated Gaussian components in terms of ratios of polynomials that correspond to moments of two carefully chosen orders logarithmic in the minimum mixing weight.
Forster Decomposition and Learning Halfspaces with Noise
Diakonikolas, Ilias, Kane, Daniel M., Tzamos, Christos
The motivating application for this paper is the problem of(distribution-independent) PAC learning of halfspaces in the presence of label noise, and more specifically in the Massart (or bounded noise) model. Recent work [DGT19] obtained the first computationally efficient learning algorithm with non-trivial error guarantee for this problem. Interestingly, the sample complexity of the [DGT19] algorithm scales polynomially with the bit complexity of the examples (in addition, of course, to the dimension and the inverse of desired accuracy). This bit-complexity dependence in the sample complexity is an artifact of the algorithmic approach in [DGT19]. Information-theoretically, no such dependence is needed -- alas, the standard VC-dimension-based sample upper bound [MN06] is non-constructive. Motivated by this qualitative gap in our understanding, here we develop a methodology that leads to a computationally efficient learning algorithm for Massart halfspaces (matching the error guarantee of [DGT19]) with "strongly polynomial" sample complexity, i.e., sample complexity completely independent of the bit complexity of the examples. Halfspaces and Efficient Learnability We study the binary classification setting, where the goal is to learn a Boolean function from random labeled examples with noisy labels. Our focus is on the problem of learning halfspaces in Valiant's PAC learning model [Val84] when the labels have been corrupted by Massart noise [MN06].
Faster algorithms for polytope rounding, sampling, and volume computation via a sublinear "Ball Walk''
Mangoubi, Oren, Vishnoi, Nisheeth K.
We study the problem of "isotropically rounding" a polytope $K\subseteq\mathbb{R}^n$, that is, computing a linear transformation which makes the uniform distribution on the polytope have roughly identity covariance matrix. We assume that $K$ is defined by $m$ linear inequalities, with guarantee that $rB\subseteq K\subseteq RB$, where $B$ is the unit ball. We introduce a new variant of the ball walk Markov chain and show that, roughly, the expected number of arithmetic operations per-step of this Markov chain is $O(m)$ that is sublinear in the input size $mn$--the per-step time of all prior Markov chains. Subsequently, we give a rounding algorithm that succeeds with probability $1-\varepsilon$ in $\tilde{O}(mn^{4.5}\mathrm{polylog}(\frac{1}{\varepsilon},\frac{R}{r}))$ arithmetic operations. This gives a factor of $\sqrt{n}$ improvement on the previous bound of $\tilde{O}(mn^{5} \mathrm{polylog}(\frac{1}{\varepsilon},\frac{R}{r}))$ for rounding, which uses the hit-and-run algorithm. Since the cost of the rounding preprocessing step is in many cases the bottleneck in improving sampling or volume computation, our results imply these tasks can also be achieved in roughly $\tilde{O}(mn^{4.5}\mathrm{polylog}(\frac{1}{\varepsilon},\frac{R}{r})+mn^4\delta^{-2})$ operations for computing the volume of $K$ up to a factor $1+\delta$ and $\tilde{O}(m n^{4.5}\mathrm{polylog}(\frac{1}{\varepsilon},\frac{R}{r})))$ for uniformly sampling on $K$ with TV error $\varepsilon$. This improves on the previous bounds of $\tilde{O}(mn^{5}\mathrm{polylog}(\frac{1}{\varepsilon},\frac{R}{r})+mn^4\delta^{-2})$ for volume computation and $\tilde{O}(mn^{5}\mathrm{polylog}(\frac{1}{\varepsilon},\frac{R}{r}))$ for sampling. We achieve this improvement by a novel method of computing polytope membership, where one avoids checking inequalities which are estimated to have a very low probability of being violated.
Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds
Bassily, Raef, Smith, Adam, Thakurta, Abhradeep
In this paper, we initiate a systematic investigation of differentially private algorithms for convex empirical risk minimization. Various instantiations of this problem have been studied before. We provide new algorithms and matching lower bounds for private ERM assuming only that each data point's contribution to the loss function is Lipschitz bounded and that the domain of optimization is bounded. We provide a separate set of algorithms and matching lower bounds for the setting in which the loss functions are known to also be strongly convex. Our algorithms run in polynomial time, and in some cases even match the optimal non-private running time (as measured by oracle complexity). We give separate algorithms (and lower bounds) for $(\epsilon,0)$- and $(\epsilon,\delta)$-differential privacy; perhaps surprisingly, the techniques used for designing optimal algorithms in the two cases are completely different. Our lower bounds apply even to very simple, smooth function families, such as linear and quadratic functions. This implies that algorithms from previous work can be used to obtain optimal error rates, under the additional assumption that the contributions of each data point to the loss function is smooth. We show that simple approaches to smoothing arbitrary loss functions (in order to apply previous techniques) do not yield optimal error rates. In particular, optimal algorithms were not previously known for problems such as training support vector machines and the high-dimensional median.
On Zeroth-Order Stochastic Convex Optimization via Random Walks
Liang, Tengyuan, Narayanan, Hariharan, Rakhlin, Alexander
We propose a method for zeroth order stochastic convex optimization that attains the suboptimality rate of $\tilde{\mathcal{O}}(n^{7}T^{-1/2})$ after $T$ queries for a convex bounded function $f:{\mathbb R}^n\to{\mathbb R}$. The method is based on a random walk (the \emph{Ball Walk}) on the epigraph of the function. The randomized approach circumvents the problem of gradient estimation, and appears to be less sensitive to noisy function evaluations compared to noiseless zeroth order methods.