Goto

Collaborating Authors

 Country


Locally Private Hypothesis Selection

arXiv.org Machine Learning

We initiate the study of hypothesis selection under local differential privacy. Given samples from an unknown probability distribution $p$ and a set of $k$ probability distributions $\mathcal{Q}$, we aim to output, under the constraints of $\varepsilon$-local differential privacy, a distribution from $\mathcal{Q}$ whose total variation distance to $p$ is comparable to the best such distribution. This is a generalization of the classic problem of $k$-wise simple hypothesis testing, which corresponds to when $p \in \mathcal{Q}$, and we wish to identify $p$. Absent privacy constraints, this problem requires $O(\log k)$ samples from $p$, and it was recently shown that the same complexity is achievable under (central) differential privacy. However, the naive approach to this problem under local differential privacy would require $\tilde O(k^2)$ samples. We first show that the constraint of local differential privacy incurs an exponential increase in cost: any algorithm for this problem requires at least $\Omega(k)$ samples. Second, for the special case of $k$-wise simple hypothesis testing, we provide a non-interactive algorithm which nearly matches this bound, requiring $\tilde O(k)$ samples. Finally, we provide sequentially interactive algorithms for the general case, requiring $\tilde O(k)$ samples and only $O(\log \log k)$ rounds of interactivity. Our algorithms are achieved through a reduction to maximum selection with adversarial comparators, a problem of independent interest for which we initiate study in the parallel setting. For this problem, we provide a family of algorithms for each number of allowed rounds of interaction $t$, as well as lower bounds showing that they are near-optimal for every $t$. Notably, our algorithms result in exponential improvements on the round complexity of previous methods.


Private Mean Estimation of Heavy-Tailed Distributions

arXiv.org Machine Learning

Given samples X 1,...,X n from a distribution D, can we estimate the mean of D? This is the problem of mean estimation which is, alongside hypothesis testing, one of the most fundamental questions in statistics. As a result, answers to this problem are known in fairly general settings. For instance, the empirical mean is known to be an optimal estimate of a distribution's true mean under minimal assumptions. That said, statistics like the empirical mean put aside any concerns related to the sensitivity, and might vary significantly based on the addition of a single datapoint in the dataset. While this is not an inherently negative feature, it becomes a problem when the dataset contains personal information, and large shifts based on a single datapoint could potentially violate the corresponding individual's privacy. In order to assuage these concerns, we consider the problem of mean estimation under the constraint of differential privacy (DP) [DMNS06], considered by many to be the gold standard of data privacy. Informally, an algorithm is said to be differentially private if its distribution over outputs is insensitive to the addition or removal of a single datapoint from the dataset. Differential privacy has enjoyed widespread adoption, including deployment in by Apple [Dif17], Google [EPK14], Microsoft [DKY17], and the US Census Bureau for the 2020 Census [DLS 17]. In this vein, a recent line of work [KV18, KLSU19, BKSW19] gives nearly optimal differentially private algorithms for mean estimation of sub-Gaussian random variables.


Privately Learning Markov Random Fields

arXiv.org Machine Learning

We consider the problem of learning Markov Random Fields (including the prototypical example, the Ising model) under the constraint of differential privacy. Our learning goals include both structure learning, where we try to estimate the underlying graph structure of the model, as well as the harder goal of parameter learning, in which we additionally estimate the parameter on each edge. We provide algorithms and lower bounds for both problems under a variety of privacy constraints -- namely pure, concentrated, and approximate differential privacy. While non-privately, both learning goals enjoy roughly the same complexity, we show that this is not the case under differential privacy. In particular, only structure learning under approximate differential privacy maintains the non-private logarithmic dependence on the dimensionality of the data, while a change in either the learning goal or the privacy notion would necessitate a polynomial dependence. As a result, we show that the privacy constraint imposes a strong separation between these two learning problems in the high-dimensional data regime.


Calibrating Deep Neural Networks using Focal Loss

arXiv.org Machine Learning

Miscalibration -- a mismatch between a model's confidence and its correctness -- of Deep Neural Networks (DNNs) makes their predictions hard to rely on. Ideally, we want networks to be accurate, calibrated and confident. We show that, as opposed to the standard cross-entropy loss, focal loss (Lin et al., 2017) allows us to learn models that are already very well calibrated. When combined with temperature scaling, whilst preserving accuracy, it yields state-of-the-art calibrated models. We provide a thorough analysis of the factors causing miscalibration, and use the insights we glean from this to justify the empirically excellent performance of focal loss. To facilitate the use of focal loss in practice, we also provide a principled approach to automatically select the hyperparameter involved in the loss function. We perform extensive experiments on a variety of computer vision and NLP datasets, and with a wide variety of network architectures, and show that our approach achieves state-of-the-art accuracy and calibration in almost all cases.


Robustness from Simple Classifiers

arXiv.org Machine Learning

Despite the vast success of Deep Neural Networks in numerous application domains, it has been shown that such models are not robust i.e., they are vulnerable to small adversarial perturbations of the input. While extensive work has been done on why such perturbations occur or how to successfully defend against them, we still do not have a complete understanding of robustness. In this work, we investigate the connection between robustness and simplicity. We find that simpler classifiers, formed by reducing the number of output classes, are less susceptible to adversarial perturbations. Consequently, we demonstrate that decomposing a complex multiclass model into an aggregation of binary models enhances robustness. This behavior is consistent across different datasets and model architectures and can be combined with known defense techniques such as adversarial training. Moreover, we provide further evidence of a disconnect between standard and robust learning regimes. In particular, we show that elaborate label information can help standard accuracy but harm robustness.


A Multiclass Classification Approach to Label Ranking

arXiv.org Machine Learning

In multiclass classification, the goal is to learn how to predict a random label $Y$, valued in $\mathcal{Y}=\{1,\; \ldots,\; K \}$ with $K\geq 3$, based upon observing a r.v. $X$, taking its values in $\mathbb{R}^q$ with $q\geq 1$ say, by means of a classification rule $g:\mathbb{R}^q\to \mathcal{Y}$ with minimum probability of error $\mathbb{P}\{Y\neq g(X) \}$. However, in a wide variety of situations, the task targeted may be more ambitious, consisting in sorting all the possible label values $y$ that may be assigned to $X$ by decreasing order of the posterior probability $\eta_y(X)=\mathbb{P}\{Y=y \mid X \}$. This article is devoted to the analysis of this statistical learning problem, halfway between multiclass classification and posterior probability estimation (regression) and referred to as label ranking here. We highlight the fact that it can be viewed as a specific variant of ranking median regression (RMR), where, rather than observing a random permutation $\Sigma$ assigned to the input vector $X$ and drawn from a Bradley-Terry-Luce-Plackett model with conditional preference vector $(\eta_1(X),\; \ldots,\; \eta_K(X))$, the sole information available for training a label ranking rule is the label $Y$ ranked on top, namely $\Sigma^{-1}(1)$. Inspired by recent results in RMR, we prove that under appropriate noise conditions, the One-Versus-One (OVO) approach to multiclassification yields, as a by-product, an optimal ranking of the labels with overwhelming probability. Beyond theoretical guarantees, the relevance of the approach to label ranking promoted in this article is supported by experimental results.


Accessing Higher-level Representations in Sequential Transformers with Feedback Memory

arXiv.org Machine Learning

Transformers are feedforward networks that can process input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input - the representation at a given layer can only access representations from lower layers, rather than the higher level representations already built in previous time steps. In this work, we propose the Feedback Transformer architecture that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. We demonstrate on a variety of benchmarks in language modeling, neural machine translation, summarization, and reinforcement learning that the increased representation capacity can improve over Transformer baselines.


It's Not What Machines Can Learn, It's What We Cannot Teach

arXiv.org Machine Learning

Can deep neural networks learn to solve any task, and in particular problems of high complexity? This question attracts a lot of interest, with recent works tackling computationally hard tasks such as the traveling salesman problem and satisfiability. In this work we offer a different perspective on this question. Given the common assumption that $\textit{NP} \neq \textit{coNP}$ we prove that any polynomial-time sample generator for an $\textit{NP}$-hard problem samples, in fact, from an easier sub-problem. We empirically explore a case study, Conjunctive Query Containment, and show how common data generation techniques generate biased datasets that lead practitioners to over-estimate model accuracy. Our results suggest that machine learning approaches that require training on a dense uniform sampling from the target distribution cannot be used to solve computationally hard problems, the reason being the difficulty of generating sufficiently large and unbiased training sets.


Adversarial Detection and Correction by Matching Prediction Distributions

arXiv.org Machine Learning

We present a novel adversarial detection and correction method for machine learning classifiers.The detector consists of an autoencoder trained with a custom loss function based on the Kullback-Leibler divergence between the classifier predictions on the original and reconstructed instances.The method is unsupervised, easy to train and does not require any knowledge about the underlying attack. The detector almost completely neutralises powerful attacks like Carlini-Wagner or SLIDE on MNIST and Fashion-MNIST, and remains very effective on CIFAR-10 when the attack is granted full access to the classification model but not the defence. We show that our method is still able to detect the adversarial examples in the case of a white-box attack where the attacker has full knowledge of both the model and the defence and investigate the robustness of the attack. The method is very flexible and can also be used to detect common data corruptions and perturbations which negatively impact the model performance. We illustrate this capability on the CIFAR-10-C dataset.


Estimation of conditional mixture Weibull distribution with right-censored data using neural network for time-to-event analysis

arXiv.org Machine Learning

In this paper, we consider survival analysis with right-censored data which is a common situation in predictive maintenance and health field. We propose a model based on the estimation of two-parameter Weibull distribution conditionally to the features. To achieve this result, we describe a neural network architecture and the associated loss functions that takes into account the right-censored data. We extend the approach to a finite mixture of two-parameter Weibull distributions. We first validate that our model is able to precisely estimate the right parameters of the conditional Weibull distribution on synthetic datasets. In numerical experiments on two real-word datasets (METABRIC and SEER), our model outperforms the state-of-the-art methods. We also demonstrate that our approach can consider any survival time horizon.