# symposium

### The Broad Optimality of Profile Maximum Likelihood

We study three fundamental statistical-learning problems: distribution estimation, property estimation, and property testing. We establish the profile maximum likelihood (PML) estimator as the first unified sample-optimal approach to a wide range of learning tasks. In particular, for every alphabet size $k$ and desired accuracy $\varepsilon$: $\textbf{Distribution estimation}$ Under $\ell_1$ distance, PML yields optimal $\Theta(k/(\varepsilon^2\log k))$ sample complexity for sorted-distribution estimation, and a PML-based estimator empirically outperforms the Good-Turing estimator on the actual distribution; $\textbf{Additive property estimation}$ For a broad class of additive properties, the PML plug-in estimator uses just four times the sample size required by the best estimator to achieve roughly twice its error, with exponentially higher confidence; $\boldsymbol{\alpha}\textbf{-R\'enyi entropy estimation}$ For integer $\alpha>1$, the PML plug-in estimator has optimal $k^{1-1/\alpha}$ sample complexity; for non-integer $\alpha>3/4$, the PML plug-in estimator has sample complexity lower than the state of the art; $\textbf{Identity testing}$ In testing whether an unknown distribution is equal to or at least $\varepsilon$ far from a given distribution in $\ell_1$ distance, a PML-based tester achieves the optimal sample complexity up to logarithmic factors of $k$. With minor modifications, most of these results also hold for a near-linear-time computable variant of PML.

### Faster Algorithms for High-Dimensional Robust Covariance Estimation

We study the problem of estimating the covariance matrix of a high-dimensional distribution when a small constant fraction of the samples can be arbitrarily corrupted. Recent work gave the first polynomial time algorithms for this problem with near-optimal error guarantees for several natural structured distributions. Our main contribution is to develop faster algorithms for this problem whose running time nearly matches that of computing the empirical covariance. Given $N = \tilde{\Omega}(d^2/\epsilon^2)$ samples from a $d$-dimensional Gaussian distribution, an $\epsilon$-fraction of which may be arbitrarily corrupted, our algorithm runs in time $\tilde{O}(d^{3.26})/\mathrm{poly}(\epsilon)$ and approximates the unknown covariance matrix to optimal error up to a logarithmic factor. Previous robust algorithms with comparable error guarantees all have runtimes $\tilde{\Omega}(d^{2 \omega})$ when $\epsilon = \Omega(1)$, where $\omega$ is the exponent of matrix multiplication. We also provide evidence that improving the running time of our algorithm may require new algorithmic techniques.

### Private Learning Implies Online Learning: An Efficient Reduction

Differential Private Learning and Online Learning are two well-studied areas in machine learning. While at a first glance these two subjects may seem disparate, recent works gathered a growing amount of evidence which suggests otherwise. For example, Adaptive Data Analysis [15, 14, 24, 19, 3] shares strong similarities with adversarial frameworks studied in online learning, and on the other hand exploits ideas and tools from differential privacy. A more formal relation between private and online learning is manifested by the following fact: Every privately learnable class is online learnable. This implication and variants of it were derived by several recent works [20, 9, 1] (see the related work section for more details). One caveat of the latter results is that they are non-constructive: they show that every privately learnable class has a finite Littlestone dimension. Then, since the Littlestone dimension is known to capture online learnability [26, 5], it follows that privately learnable classes are indeed online learnable. Consequently, the implied online learner is not necessarily efficient, even if the assumed private learner is.

### A Direct $\tilde{O}(1/\epsilon)$ Iteration Parallel Algorithm for Optimal Transport

Optimal transportation, or computing the Wasserstein or earth mover's'' distance between two distributions, is a fundamental primitive which arises in many learning and statistical settings. We give an algorithm which solves this problem to additive $\epsilon$ with $\tilde{O}(1/\epsilon)$ parallel depth, and $\tilde{O}\left(n^2/\epsilon\right)$ work. Barring a breakthrough on a long-standing algorithmic open problem, this is optimal for first-order methods. Blanchet et. al. '18, Quanrud '19 obtained similar runtimes through reductions to positive linear programming and matrix scaling. However, these reduction-based algorithms use complicated subroutines which may be deemed impractical due to requiring solvers for second-order iterations (matrix scaling) or non-parallelizability (positive LP). The fastest practical algorithms run in time $\tilde{O}(\min(n^2 / \epsilon^2, n^{2.5} / \epsilon))$ (Dvurechensky et. al. '18, Lin et. al. '19). We bridge this gap by providing a parallel, first-order, $\tilde{O}(1/\epsilon)$ iteration algorithm without worse dependence on dimension, and provide preliminary experimental evidence that our algorithm may enjoy improved practical performance. We obtain this runtime via a primal-dual extragradient method, motivated by recent theoretical improvements to maximum flow (Sherman '17).

### Private Hypothesis Selection

We provide a differentially private algorithm for hypothesis selection. Given samples from an unknown probability distribution $P$ and a set of $m$ probability distributions $\mathcal{H}$, the goal is to output, in a $\varepsilon$-differentially private manner, a distribution from $\mathcal{H}$ whose total variation distance to $P$ is comparable to that of the best such distribution (which we denote by $\alpha$). The sample complexity of our basic algorithm is $O\left(\frac{\log m}{\alpha^2} + \frac{\log m}{\alpha \varepsilon}\right)$, representing a minimal cost for privacy when compared to the non-private algorithm. We also can handle infinite hypothesis classes $\mathcal{H}$ by relaxing to $(\varepsilon,\delta)$-differential privacy. We apply our hypothesis selection algorithm to give learning algorithms for a number of natural distribution classes, including Gaussians, product distributions, sums of independent random variables, piecewise polynomials, and mixture classes. Our hypothesis selection procedure allows us to generically convert a cover for a class to a learning algorithm, complementing known learning lower bounds which are in terms of the size of the packing number of the class. As the covering and packing numbers are often closely related, for constant $\alpha$, our algorithms achieve the optimal sample complexity for many classes of interest. Finally, we describe an application to private distribution-free PAC learning.

### Sequential no-Substitution k-Median-Clustering

We study the sample-based $k$-median clustering objective under a sequential setting without substitutions. In this setting, the goal is to select k centers that approximate the optimal clustering on an unknown distribution from a finite sequence of i.i.d. samples, where any selection of a center must be done immediately after the center is observed and centers cannot be substituted after selection. We provide an efficient algorithm for this setting, and show that its multiplicative approximation factor is twice the approximation factor of an efficient offline algorithm. In addition, we show that if efficiency requirements are removed, there is an algorithm that can obtain the same approximation factor as the best offline algorithm.

### Private Identity Testing for High-Dimensional Distributions

In this work we present novel differentially private identity (goodness-of-fit) testers for natural and widely studied classes of multivariate product distributions: Gaussians in $\mathbb{R}^d$ with known covariance and product distributions over $\{\pm 1\}^{d}$. Our testers have improved sample complexity compared to those derived from previous techniques, and are the first testers whose sample complexity matches the order-optimal minimax sample complexity of $O(d^{1/2}/\alpha^2)$ in many parameter regimes. We construct two types of testers, exhibiting tradeoffs between sample complexity and computational complexity. Finally, we provide a two-way reduction between testing a subclass of multivariate product distributions and testing univariate distributions, and thereby obtain upper and lower bounds for testing this subclass of product distributions.

### Private Learning and Regularized Optimal Transport

Private data are valuable either by remaining private (for instance if they are sensitive) or, on the other hand, by being used publicly to increase some utility. These two objectives are antagonistic and leaking data might be more rewarding than concealing them. Unlike classical concepts of privacy that focus on the first point, we consider instead agents that optimize a natural trade-off between both objectives. We formalize this as an optimization problem where the objective mapping is regularized by the amount of information leaked by the agent into the system (measured as a divergence between the prior and posterior on the private data). Quite surprisingly, when combined with the entropic regularization, the Sinkhorn divergence naturally emerges in the optimization objective, making it efficiently solvable. We apply these techniques to preserve some privacy in online repeated auctions.

### Mayo Clinic hold artificial intelligence symposium

ROCHESTER, Minn.-Mayo Clinic is one of the world leaders in healthcare. Now, the institution is looking for the next best way to provide efficient care to patients. Mayo held an artificial intelligence symposium, looking to bring together groups of developers from around the world. They're working on projects using technology to help patients. Just one of the ingenious ideas includes computing systems that can interpret medical information and offer preliminary recommendations for patients.

### Solving Empirical Risk Minimization in the Current Matrix Multiplication Time

Many convex problems in machine learning and computer science share the same form: \begin{align*} \min_{x} \sum_{i} f_i( A_i x + b_i), \end{align*} where $f_i$ are convex functions on $\mathbb{R}^{n_i}$ with constant $n_i$, $A_i \in \mathbb{R}^{n_i \times d}$, $b_i \in \mathbb{R}^{n_i}$ and $\sum_i n_i = n$. This problem generalizes linear programming and includes many problems in empirical risk minimization. In this paper, we give an algorithm that runs in time \begin{align*} O^* ( ( n^{\omega} + n^{2.5 - \alpha/2} + n^{2+ 1/6} ) \log (n / \delta) ) \end{align*} where $\omega$ is the exponent of matrix multiplication, $\alpha$ is the dual exponent of matrix multiplication, and $\delta$ is the relative accuracy. Note that the runtime has only a log dependence on the condition numbers or other data dependent parameters and these are captured in $\delta$. For the current bound $\omega \sim 2.38$ [Vassilevska Williams'12, Le Gall'14] and $\alpha \sim 0.31$ [Le Gall, Urrutia'18], our runtime $O^* ( n^{\omega} \log (n / \delta))$ matches the current best for solving a dense least squares regression problem, a special case of the problem we consider. Very recently, [Alman'18] proved that all the current known techniques can not give a better $\omega$ below $2.168$ which is larger than our $2+1/6$. Our result generalizes the very recent result of solving linear programs in the current matrix multiplication time [Cohen, Lee, Song'19] to a more broad class of problems. Our algorithm proposes two concepts which are different from [Cohen, Lee, Song'19] : $\bullet$ We give a robust deterministic central path method, whereas the previous one is a stochastic central path which updates weights by a random sparse vector. $\bullet$ We propose an efficient data-structure to maintain the central path of interior point methods even when the weights update vector is dense.