AITopics

2310.06333

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.73)

arXiv.org Artificial IntelligenceAug-14-2023

Private Distribution Learning with Public Data: The View from Sample Compression

Ben-David, Shai, Bie, Alex, Canonne, Clément L., Kamath, Gautam, Singhal, Vikrant

We study the problem of private distribution learning with access to public data. In this setup, which we refer to as public-private learning, the learner is given public and private samples drawn from an unknown distribution $p$ belonging to a class $\mathcal Q$, with the goal of outputting an estimate of $p$ while adhering to privacy constraints (here, pure differential privacy) only with respect to the private samples. We show that the public-private learnability of a class $\mathcal Q$ is connected to the existence of a sample compression scheme for $\mathcal Q$, as well as to an intermediate notion we refer to as list learning. Leveraging this connection: (1) approximately recovers previous results on Gaussians over $\mathbb R^d$; and (2) leads to new ones, including sample complexity upper bounds for arbitrary $k$-mixtures of Gaussians over $\mathbb R^d$, results for agnostic and distribution-shift resistant learners, as well as closure properties for public-private learnability under taking mixtures and products of distributions. Finally, via the connection to list learning, we show that for Gaussians in $\mathbb R^d$, at least $d$ public samples are necessary for private learnability, which is close to the known upper bound of $d+1$ public samples.

artificial intelligence, gaussian, machine learning, (14 more...)

2308.06239

Country: North America > United States > New York (0.14)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceJun-12-2023

Concentration Bounds for Discrete Distribution Estimation in KL Divergence

Canonne, Clément L., Sun, Ziteng, Suresh, Ananda Theertha

Discrete distribution estimation, i.e., density estimation over discrete domains, is a fundamental problem in Statistics, with a rich history (see, e.g., [9, 10] for an overview and further references). In this work, we address a simple yet surprisingly ill-understood aspect of this question: what is sample complexity of estimating an arbitrary discrete distribution in Kullback-Leibler (KL) divergence with vanishing probability of error? To describe the problem further, a few definitions are in order.

artificial intelligence, estimator, machine learning, (17 more...)

2302.06869

Country: North America > United States (0.14)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.48)

arXiv.org Artificial IntelligenceApr-12-2023

Near-Optimal Degree Testing for Bayes Nets

Arora, Vipul, Bhattacharyya, Arnab, Canonne, Clément L., Yang, Joy Qiping

This paper considers the problem of testing the maximum in-degree of the Bayes net underlying an unknown probability distribution $P$ over $\{0,1\}^n$, given sample access to $P$. We show that the sample complexity of the problem is $\tilde{\Theta}(2^{n/2}/\varepsilon^2)$. Our algorithm relies on a testing-by-learning framework, previously used to obtain sample-optimal testers; in order to apply this framework, we develop new algorithms for ``near-proper'' learning of Bayes nets, and high-probability learning under $\chi^2$ divergence, which are of independent interest.

artificial intelligence, bayes net, machine learning, (18 more...)

2304.06733

Genre: Research Report (0.64)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.95)

arXiv.org Artificial IntelligenceJan-3-2023

Independence Testing for Bounded Degree Bayesian Network

Bhattacharyya, Arnab, Canonne, Clément L., Yang, Joy Qiping

We study the following independence testing problem: given access to samples from a distribution $P$ over $\{0,1\}^n$, decide whether $P$ is a product distribution or whether it is $\varepsilon$-far in total variation distance from any product distribution. For arbitrary distributions, this problem requires $\exp(n)$ samples. We show in this work that if $P$ has a sparse structure, then in fact only linearly many samples are required. Specifically, if $P$ is Markov with respect to a Bayesian network whose underlying DAG has in-degree bounded by $d$, then $\tilde{\Theta}(2^{d/2}\cdot n/\varepsilon^2)$ samples are necessary and sufficient for independence testing.

artificial intelligence, machine learning, product distribution, (18 more...)

2204.0869

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.72)

arXiv.org Artificial IntelligenceNov-15-2022

Unified lower bounds for interactive high-dimensional estimation under information constraints

Acharya, Jayadev, Canonne, Clément L., Sun, Ziteng, Tyagi, Himanshu

We consider the problem of parameter estimation under local information constraints, where the estimation algorithm has access to only limited information about each sample. These constraints can be of various types, including communication constraints, where each sample must be described using a few (e.g., constant number of) bits; (local) privacy constraints, where each sample is obtained from a different user and the users seek to reveal as little as possible about their specific data; as well as many others, e.g., noisy communication channels, or limited types of data access such as linear measurements. Such problems have received a lot of attention in recent years, motivated by applications such as data analytics in distributed systems and federated learning. Our main focus is on information-theoretic lower bounds for the minimax error rates (or, equivalently, the sample complexity) of these problems. Several recent works have provided different bounds that apply to specific constraints or work for specific parametric estimation problems, sometimes without allowing for interactive protocols. Indeed, handling interactive protocols is technically challenging, and several results in prior work exhibit flaws in their analysis. In particular, even the most basic Gaussian mean estimation problem using interactive communication remains, quite surprisingly, open. We present general, "plug-and-play" lower bounds for parametric estimation under information constraints that can be used for any local information constraint and allows for interactive protocols. Our abstract bound requires very simple (and natural) assumptions to hold for the underlying parametric family; in particular, we do not require technical "regularity" conditions that are common in asymptotic statistics.

artificial intelligence, constraint, machine learning, (14 more...)

2010.06562

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Learning in High Dimensional Spaces (0.40)

arXiv.org Artificial IntelligenceNov-4-2022

Robust Testing in High-Dimensional Sparse Models

George, Anand Jerry, Canonne, Clément L.

We consider the problem of robustly testing the norm of a high-dimensional sparse signal vector under two different observation models. In the first model, we are given $n$ i.i.d. samples from the distribution $\mathcal{N}\left(\theta,I_d\right)$ (with unknown $\theta$), of which a small fraction has been arbitrarily corrupted. Under the promise that $\|\theta\|_0\le s$, we want to correctly distinguish whether $\|\theta\|_2=0$ or $\|\theta\|_2>\gamma$, for some input parameter $\gamma>0$. We show that any algorithm for this task requires $n=\Omega\left(s\log\frac{ed}{s}\right)$ samples, which is tight up to logarithmic factors. We also extend our results to other common notions of sparsity, namely, $\|\theta\|_q\le s$ for any $0 < q < 2$. In the second observation model that we consider, the data is generated according to a sparse linear regression model, where the covariates are i.i.d. Gaussian and the regression coefficient (signal) is known to be $s$-sparse. Here too we assume that an $\epsilon$-fraction of the data is arbitrarily corrupted. We show that any algorithm that reliably tests the norm of the regression coefficient requires at least $n=\Omega\left(\min(s\log d,{1}/{\gamma^4})\right)$ samples. Our results show that the complexity of testing in these two settings significantly increases under robustness constraints. This is in line with the recent observations made in robust mean testing and robust covariance testing.

artificial intelligence, machine learning, sample complexity, (15 more...)

2205.07488

Genre: Research Report > New Finding (0.74)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.97)

arXiv.org Machine LearningAug-19-2021

Uniformity Testing in the Shuffle Model: Simpler, Better, Faster

Canonne, Clément L., Lyu, Hongyi

Learning from, or, more generally, performing statistical inference on sensitive or private data has become an increasingly important topic, where one must balance the desire to achieve good accuracy with the requirement to preserve privacy of the users' data. Among the many tasks concerned, hypothesis testing and, more specifically, goodness-of-fit testing is of particular importance, given its ubiquitous role in data analysis, the natural sciences, and more broadly as a workhorse of statistics and machine learning. In this paper, we consider the specific case of uniformity testing, the prototypical example of goodness-of-fit testing of discrete distributions, where one seeks to decide whether the data is drawn uniformly from a known finite domain. Investigating the trade-off between accuracy (or, equivalently, data requirements) and privacy for this task has received considerable attention over the past years in a variety of privacy models, including the central and local models of differential privacy, the so-called pan-privacy, and the recently proposed model of shuffle privacy. Unfortunately, while this trade-off is now well understood in most of the aforementioned privacy settings, some of the proposed algorithms remain relatively complex and far from practical, and their analysis quite involved. With this in mind, we focus in this paper on private uniformity testing in the shuffle model, both simplifying the analysis of the existing algorithms for this task and obtaining a new, arguably simpler one with the same guarantees.

algorithm, artificial intelligence, machine learning, (16 more...)

2108.08987

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.34)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.34)

arXiv.org Machine LearningJun-24-2021

The Price of Tolerance in Distribution Testing

Canonne, Clément L., Jain, Ayush, Kamath, Gautam, Li, Jerry

Upon observing independent samples from an unknown probability distribution, can we determine whether it possess some property of interest? This natural question, known as distribution testing or statistical hypothesis testing, has enjoyed significant study from several communities, including theoretical computer science, statistics, information theory, and machine learning.

artificial intelligence, inequality, sample complexity, (17 more...)

2106.13414

Country: North America > United States > Massachusetts (0.14)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

arXiv.org Machine LearningMay-28-2019

Private Identity Testing for High-Dimensional Distributions

Canonne, Clément L., Kamath, Gautam, McMillan, Audra, Ullman, Jonathan, Zakynthinou, Lydia

In this work we present novel differentially private identity (goodness-of-fit) testers for natural and widely studied classes of multivariate product distributions: Gaussians in $\mathbb{R}^d$ with known covariance and product distributions over $\{\pm 1\}^{d}$. Our testers have improved sample complexity compared to those derived from previous techniques, and are the first testers whose sample complexity matches the order-optimal minimax sample complexity of $O(d^{1/2}/\alpha^2)$ in many parameter regimes. We construct two types of testers, exhibiting tradeoffs between sample complexity and computational complexity. Finally, we provide a two-way reduction between testing a subclass of multivariate product distributions and testing univariate distributions, and thereby obtain upper and lower bounds for testing this subclass of product distributions.

algorithm, artificial intelligence, machine learning, (18 more...)

1905.11947

Country: North America > United States (1.00)

Genre: Research Report > Experimental Study (0.67)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)