Goto

Collaborating Authors

 Canonne, Clément L.


Learning bounded-degree polytrees with known skeleton

arXiv.org Machine Learning

We establish finite-sample guarantees for efficient proper learning of bounded-degree polytrees, a rich class of high-dimensional probability distributions and a subclass of Bayesian networks, a widely-studied type of graphical model. Recently, Bhattacharyya et al. (2021) obtained finite-sample guarantees for recovering tree-structured Bayesian networks, i.e., 1-polytrees. We extend their results by providing an efficient algorithm which learns $d$-polytrees in polynomial time and sample complexity for any bounded $d$ when the underlying undirected graph (skeleton) is known. We complement our algorithm with an information-theoretic sample complexity lower bound, showing that the dependence on the dimension and target accuracy parameters are nearly tight.


Private Distribution Learning with Public Data: The View from Sample Compression

arXiv.org Artificial Intelligence

We study the problem of private distribution learning with access to public data. In this setup, which we refer to as public-private learning, the learner is given public and private samples drawn from an unknown distribution $p$ belonging to a class $\mathcal Q$, with the goal of outputting an estimate of $p$ while adhering to privacy constraints (here, pure differential privacy) only with respect to the private samples. We show that the public-private learnability of a class $\mathcal Q$ is connected to the existence of a sample compression scheme for $\mathcal Q$, as well as to an intermediate notion we refer to as list learning. Leveraging this connection: (1) approximately recovers previous results on Gaussians over $\mathbb R^d$; and (2) leads to new ones, including sample complexity upper bounds for arbitrary $k$-mixtures of Gaussians over $\mathbb R^d$, results for agnostic and distribution-shift resistant learners, as well as closure properties for public-private learnability under taking mixtures and products of distributions. Finally, via the connection to list learning, we show that for Gaussians in $\mathbb R^d$, at least $d$ public samples are necessary for private learnability, which is close to the known upper bound of $d+1$ public samples.


Concentration Bounds for Discrete Distribution Estimation in KL Divergence

arXiv.org Artificial Intelligence

Discrete distribution estimation, i.e., density estimation over discrete domains, is a fundamental problem in Statistics, with a rich history (see, e.g., [9, 10] for an overview and further references). In this work, we address a simple yet surprisingly ill-understood aspect of this question: what is sample complexity of estimating an arbitrary discrete distribution in Kullback-Leibler (KL) divergence with vanishing probability of error? To describe the problem further, a few definitions are in order.


Near-Optimal Degree Testing for Bayes Nets

arXiv.org Artificial Intelligence

This paper considers the problem of testing the maximum in-degree of the Bayes net underlying an unknown probability distribution $P$ over $\{0,1\}^n$, given sample access to $P$. We show that the sample complexity of the problem is $\tilde{\Theta}(2^{n/2}/\varepsilon^2)$. Our algorithm relies on a testing-by-learning framework, previously used to obtain sample-optimal testers; in order to apply this framework, we develop new algorithms for ``near-proper'' learning of Bayes nets, and high-probability learning under $\chi^2$ divergence, which are of independent interest.


Independence Testing for Bounded Degree Bayesian Network

arXiv.org Artificial Intelligence

We study the following independence testing problem: given access to samples from a distribution $P$ over $\{0,1\}^n$, decide whether $P$ is a product distribution or whether it is $\varepsilon$-far in total variation distance from any product distribution. For arbitrary distributions, this problem requires $\exp(n)$ samples. We show in this work that if $P$ has a sparse structure, then in fact only linearly many samples are required. Specifically, if $P$ is Markov with respect to a Bayesian network whose underlying DAG has in-degree bounded by $d$, then $\tilde{\Theta}(2^{d/2}\cdot n/\varepsilon^2)$ samples are necessary and sufficient for independence testing.


Unified lower bounds for interactive high-dimensional estimation under information constraints

arXiv.org Artificial Intelligence

We consider the problem of parameter estimation under local information constraints, where the estimation algorithm has access to only limited information about each sample. These constraints can be of various types, including communication constraints, where each sample must be described using a few (e.g., constant number of) bits; (local) privacy constraints, where each sample is obtained from a different user and the users seek to reveal as little as possible about their specific data; as well as many others, e.g., noisy communication channels, or limited types of data access such as linear measurements. Such problems have received a lot of attention in recent years, motivated by applications such as data analytics in distributed systems and federated learning. Our main focus is on information-theoretic lower bounds for the minimax error rates (or, equivalently, the sample complexity) of these problems. Several recent works have provided different bounds that apply to specific constraints or work for specific parametric estimation problems, sometimes without allowing for interactive protocols. Indeed, handling interactive protocols is technically challenging, and several results in prior work exhibit flaws in their analysis. In particular, even the most basic Gaussian mean estimation problem using interactive communication remains, quite surprisingly, open. We present general, "plug-and-play" lower bounds for parametric estimation under information constraints that can be used for any local information constraint and allows for interactive protocols. Our abstract bound requires very simple (and natural) assumptions to hold for the underlying parametric family; in particular, we do not require technical "regularity" conditions that are common in asymptotic statistics.


Robust Testing in High-Dimensional Sparse Models

arXiv.org Artificial Intelligence

We consider the problem of robustly testing the norm of a high-dimensional sparse signal vector under two different observation models. In the first model, we are given $n$ i.i.d. samples from the distribution $\mathcal{N}\left(\theta,I_d\right)$ (with unknown $\theta$), of which a small fraction has been arbitrarily corrupted. Under the promise that $\|\theta\|_0\le s$, we want to correctly distinguish whether $\|\theta\|_2=0$ or $\|\theta\|_2>\gamma$, for some input parameter $\gamma>0$. We show that any algorithm for this task requires $n=\Omega\left(s\log\frac{ed}{s}\right)$ samples, which is tight up to logarithmic factors. We also extend our results to other common notions of sparsity, namely, $\|\theta\|_q\le s$ for any $0 < q < 2$. In the second observation model that we consider, the data is generated according to a sparse linear regression model, where the covariates are i.i.d. Gaussian and the regression coefficient (signal) is known to be $s$-sparse. Here too we assume that an $\epsilon$-fraction of the data is arbitrarily corrupted. We show that any algorithm that reliably tests the norm of the regression coefficient requires at least $n=\Omega\left(\min(s\log d,{1}/{\gamma^4})\right)$ samples. Our results show that the complexity of testing in these two settings significantly increases under robustness constraints. This is in line with the recent observations made in robust mean testing and robust covariance testing.


Uniformity Testing in the Shuffle Model: Simpler, Better, Faster

arXiv.org Machine Learning

Learning from, or, more generally, performing statistical inference on sensitive or private data has become an increasingly important topic, where one must balance the desire to achieve good accuracy with the requirement to preserve privacy of the users' data. Among the many tasks concerned, hypothesis testing and, more specifically, goodness-of-fit testing is of particular importance, given its ubiquitous role in data analysis, the natural sciences, and more broadly as a workhorse of statistics and machine learning. In this paper, we consider the specific case of uniformity testing, the prototypical example of goodness-of-fit testing of discrete distributions, where one seeks to decide whether the data is drawn uniformly from a known finite domain. Investigating the trade-off between accuracy (or, equivalently, data requirements) and privacy for this task has received considerable attention over the past years in a variety of privacy models, including the central and local models of differential privacy, the so-called pan-privacy, and the recently proposed model of shuffle privacy. Unfortunately, while this trade-off is now well understood in most of the aforementioned privacy settings, some of the proposed algorithms remain relatively complex and far from practical, and their analysis quite involved. With this in mind, we focus in this paper on private uniformity testing in the shuffle model, both simplifying the analysis of the existing algorithms for this task and obtaining a new, arguably simpler one with the same guarantees.


The Price of Tolerance in Distribution Testing

arXiv.org Machine Learning

Upon observing independent samples from an unknown probability distribution, can we determine whether it possess some property of interest? This natural question, known as distribution testing or statistical hypothesis testing, has enjoyed significant study from several communities, including theoretical computer science, statistics, information theory, and machine learning.


Private Identity Testing for High-Dimensional Distributions

arXiv.org Machine Learning

In this work we present novel differentially private identity (goodness-of-fit) testers for natural and widely studied classes of multivariate product distributions: Gaussians in $\mathbb{R}^d$ with known covariance and product distributions over $\{\pm 1\}^{d}$. Our testers have improved sample complexity compared to those derived from previous techniques, and are the first testers whose sample complexity matches the order-optimal minimax sample complexity of $O(d^{1/2}/\alpha^2)$ in many parameter regimes. We construct two types of testers, exhibiting tradeoffs between sample complexity and computational complexity. Finally, we provide a two-way reduction between testing a subclass of multivariate product distributions and testing univariate distributions, and thereby obtain upper and lower bounds for testing this subclass of product distributions.