Goto

Collaborating Authors

 Mahmoody, Mohammad


Publicly Detectable Watermarking for Language Models

arXiv.org Artificial Intelligence

We construct the first provable watermarking scheme for language models with public detectability or verifiability: we use a private key for watermarking and a public key for watermark detection. Our protocol is the first watermarking scheme that does not embed a statistical signal in generated text. Rather, we directly embed a publicly-verifiable cryptographic signature using a form of rejection sampling. We show that our construction meets strong formal security guarantees and preserves many desirable properties found in schemes in the private-key watermarking setting. In particular, our watermarking scheme retains distortion-freeness and model agnosticity. We implement our scheme and make empirical measurements over open models in the 7B parameter range. Our experiments suggest that our watermarking scheme meets our formal claims while preserving text quality.


On Optimal Learning Under Targeted Data Poisoning

arXiv.org Artificial Intelligence

Consider the task of learning a hypothesis class $\mathcal{H}$ in the presence of an adversary that can replace up to an $\eta$ fraction of the examples in the training set with arbitrary adversarial examples. The adversary aims to fail the learner on a particular target test point $x$ which is known to the adversary but not to the learner. In this work we aim to characterize the smallest achievable error $\epsilon=\epsilon(\eta)$ by the learner in the presence of such an adversary in both realizable and agnostic settings. We fully achieve this in the realizable setting, proving that $\epsilon=\Theta(\mathtt{VC}(\mathcal{H})\cdot \eta)$, where $\mathtt{VC}(\mathcal{H})$ is the VC dimension of $\mathcal{H}$. Remarkably, we show that the upper bound can be attained by a deterministic learner. In the agnostic setting we reveal a more elaborate landscape: we devise a deterministic learner with a multiplicative regret guarantee of $\epsilon \leq C\cdot\mathtt{OPT} + O(\mathtt{VC}(\mathcal{H})\cdot \eta)$, where $C > 1$ is a universal numerical constant. We complement this by showing that for any deterministic learner there is an attack which worsens its error to at least $2\cdot \mathtt{OPT}$. This implies that a multiplicative deterioration in the regret is unavoidable in this case. Finally, the algorithms we develop for achieving the optimal rates are inherently improper. Nevertheless, we show that for a variety of natural concept classes, such as linear classifiers, it is possible to retain the dependence $\epsilon=\Theta_{\mathcal{H}}(\eta)$ by a proper algorithm in the realizable setting. Here $\Theta_{\mathcal{H}}$ conceals a polynomial dependence on $\mathtt{VC}(\mathcal{H})$.


Lower Bounds for Adversarially Robust PAC Learning

arXiv.org Machine Learning

In this work, we initiate a formal study of probably approximately correct (PAC) learning under evasion attacks, where the adversary's goal is to \emph{misclassify} the adversarially perturbed sample point $\widetilde{x}$, i.e., $h(\widetilde{x})\neq c(\widetilde{x})$, where $c$ is the ground truth concept and $h$ is the learned hypothesis. Previous work on PAC learning of adversarial examples have all modeled adversarial examples as corrupted inputs in which the goal of the adversary is to achieve $h(\widetilde{x}) \neq c(x)$, where $x$ is the original untampered instance. These two definitions of adversarial risk coincide for many natural distributions, such as images, but are incomparable in general. We first prove that for many theoretically natural input spaces of high dimension $n$ (e.g., isotropic Gaussian in dimension $n$ under $\ell_2$ perturbations), if the adversary is allowed to apply up to a sublinear $o(||x||)$ amount of perturbations on the test instances, PAC learning requires sample complexity that is exponential in $n$. This is in contrast with results proved using the corrupted-input framework, in which the sample complexity of robust learning is only polynomially more. We then formalize hybrid attacks in which the evasion attack is preceded by a poisoning attack. This is perhaps reminiscent of "trapdoor attacks" in which a poisoning phase is involved as well, but the evasion phase here uses the error-region definition of risk that aims at misclassifying the perturbed instances. In this case, we show PAC learning is sometimes impossible all together, even when it is possible without the attack (e.g., due to the bounded VC dimension).


Empirically Measuring Concentration: Fundamental Limits on Intrinsic Robustness

arXiv.org Machine Learning

Many recent works have shown that adversarial examples that fool classifiers can be found by minimally perturbing a normal input. Recent theoretical results, starting with Gilmer et al. (2018), show that if the inputs are drawn from a concentrated metric probability space, then adversarial examples with small perturbation are inevitable. A concentrated space has the property that any subset with $\Omega(1)$ (e.g., 1/100) measure, according to the imposed distribution, has small distance to almost all (e.g., 99/100) of the points in the space. It is not clear, however, whether these theoretical results apply to actual distributions such as images. This paper presents a method for empirically measuring and bounding the concentration of a concrete dataset which is proven to converge to the actual concentration. We use it to empirically estimate the intrinsic robustness to $\ell_\infty$ and $\ell_2$ perturbations of several image classification benchmarks.


Adversarially Robust Learning Could Leverage Computational Hardness

arXiv.org Machine Learning

Over recent years, devising classification algorithms that are robust to adversarial perturbations has emerged as a challenging problem. In particular, deep neural nets (DNNs) seem to be susceptible to small imperceptible changes over test instances. In this work, we study whether there is any learning task for which it is possible to design classifiers that are only robust against polynomial-time adversaries. Indeed, numerous cryptographic tasks (e.g. encryption of long messages) are only be secure against computationally bounded adversaries, and are indeed mpossible for computationally unbounded attackers. Thus, it is natural to ask if the same strategy could help robust learning. We show that computational limitation of attackers can indeed be useful in robust learning by demonstrating a classifier for a learning task in which computational and information theoretic adversaries of bounded perturbations have very different power. Namely, while computationally unbounded adversaries can attack successfully and find adversarial examples with small perturbation, polynomial time adversaries are unable to do so unless they can break standard cryptographic hardness assumptions. Our results, therefore, indicate that perhaps a similar approach to cryptography (relying on computational hardness) holds promise for achieving computationally robust machine learning. We also show that the existence of such learning task in which computational robustness beats information theoretic robustness implies (average case) hard problems in $\mathbf{NP}$.


Adversarial Risk and Robustness: General Definitions and Implications for the Uniform Distribution

Neural Information Processing Systems

We study adversarial perturbations when the instances are uniformly distributed over {0,1}^n. We study both "inherent" bounds that apply to any problem and any classifier for such a problem as well as bounds that apply to specific problems and specific hypothesis classes. As the current literature contains multiple definitions of adversarial risk and robustness, we start by giving a taxonomy for these definitions based on their direct goals; we identify one of them as the one guaranteeing misclassification by pushing the instances to the error region. We then study some classic algorithms for learning monotone conjunctions and compare their adversarial risk and robustness under different definitions by attacking the hypotheses using instances drawn from the uniform distribution. We observe that sometimes these definitions lead to significantly different bounds. Thus, this study advocates for the use of the error-region definition, even though other definitions, in other contexts with context-dependent assumptions, may coincide with the error-region definition. Using the error-region definition of adversarial perturbations, we then study inherent bounds on risk and robustness of any classifier for any classification problem whose instances are uniformly distributed over {0,1}^n. Using the isoperimetric inequality for the Boolean hypercube, we show that for initial error 0.01, there always exists an adversarial perturbation that changes O(โˆšn) bits of the instances to increase the risk to 0.5, making classifier's decisions meaningless. Furthermore, by also using the central limit theorem we show that when nโ†’โˆž, at most cโˆšn bits of perturbations, for a universal constant c<1.17, suffice for increasing the risk to 0.5, and the same cโˆšn bits of perturbations on average suffice to increase the risk to 1, hence bounding the robustness by cโˆšn.


Adversarial Risk and Robustness: General Definitions and Implications for the Uniform Distribution

Neural Information Processing Systems

We study adversarial perturbations when the instances are uniformly distributed over {0,1}^n. We study both "inherent" bounds that apply to any problem and any classifier for such a problem as well as bounds that apply to specific problems and specific hypothesis classes. As the current literature contains multiple definitions of adversarial risk and robustness, we start by giving a taxonomy for these definitions based on their direct goals; we identify one of them as the one guaranteeing misclassification by pushing the instances to the error region. We then study some classic algorithms for learning monotone conjunctions and compare their adversarial risk and robustness under different definitions by attacking the hypotheses using instances drawn from the uniform distribution. We observe that sometimes these definitions lead to significantly different bounds. Thus, this study advocates for the use of the error-region definition, even though other definitions, in other contexts with context-dependent assumptions, may coincide with the error-region definition. Using the error-region definition of adversarial perturbations, we then study inherent bounds on risk and robustness of any classifier for any classification problem whose instances are uniformly distributed over {0,1}^n. Using the isoperimetric inequality for the Boolean hypercube, we show that for initial error 0.01, there always exists an adversarial perturbation that changes O(โˆšn) bits of the instances to increase the risk to 0.5, making classifier's decisions meaningless. Furthermore, by also using the central limit theorem we show that when nโ†’โˆž, at most cโˆšn bits of perturbations, for a universal constant c<1.17, suffice for increasing the risk to 0.5, and the same cโˆšn bits of perturbations on average suffice to increase the risk to 1, hence bounding the robustness by cโˆšn.


Can Adversarially Robust Learning Leverage Computational Hardness?

arXiv.org Machine Learning

Making learners robust to adversarial perturbation at test time (i.e., evasion attacks) or training time (i.e., poisoning attacks) has emerged as a challenging task. It is known that for some natural settings, sublinear perturbations in the training phase or the testing phase can drastically decrease the quality of the predictions. These negative results, however, are information theoretic and only prove the existence of such successful adversarial perturbations. A natural question for these settings is whether or not we can make classifiers computationally robust to polynomial-time attacks. In this work, we prove strong barriers against achieving such envisioned computational robustness both for evasion and poisoning attacks. In particular, we show that if the test instances come from a product distribution (e.g., uniform over $\{0,1\}^n$ or $[0,1]^n$, or isotropic $n$-variate Gaussian) and that there is an initial constant error, then there exists a polynomial-time attack that finds adversarial examples of Hamming distance $O(\sqrt n)$. For poisoning attacks, we prove that for any learning algorithm with sample complexity $m$ and any efficiently computable "predicate" defining some "bad" property $B$ for the produced hypothesis (e.g., failing on a particular test) that happens with an initial constant probability, there exist polynomial-time online poisoning attacks that tamper with $O (\sqrt m)$ many examples, replace them with other correctly labeled examples, and increases the probability of the bad event $B$ to $\approx 1$. Both of our poisoning and evasion attacks are black-box in how they access their corresponding components of the system (i.e., the hypothesis, the concept, and the learning algorithm) and make no further assumptions about the classifier or the learning algorithm producing the classifier.


Adversarial Risk and Robustness: General Definitions and Implications for the Uniform Distribution

arXiv.org Machine Learning

We study adversarial perturbations when the instances are uniformly distributed over $\{0,1\}^n$. We study both "inherent" bounds that apply to any problem and any classifier for such a problem as well as bounds that apply to specific problems and specific hypothesis classes. As the current literature contains multiple definitions of adversarial risk and robustness, we start by giving a taxonomy for these definitions based on their goals, we identify one of them as the one guaranteeing misclassification by pushing the instances to the error region. We then study some classic algorithms for learning monotone conjunctions and compare their adversarial risk and robustness under different definitions by attacking the hypotheses using instances drawn from the uniform distribution. We observe that sometimes these definitions lead to significantly different bounds. Thus, this study advocates for the use of the error-region definition, even though other definitions, in other contexts, may coincide with the error-region definition. Using the error-region definition of adversarial perturbations, we then study inherent bounds on risk and robustness of any classifier for any classification problem whose instances are uniformly distributed over $\{0,1\}^n$. Using the isoperimetric inequality for the Boolean hypercube, we show that for initial error $0.01$, there always exists an adversarial perturbation that changes $O(\sqrt{n})$ bits of the instances to increase the risk to $0.5$, making classifier's decisions meaningless. Furthermore, by also using the central limit theorem we show that when $n\to \infty$, at most $c \cdot \sqrt{n}$ bits of perturbations, for a universal constant $c< 1.17$, suffice for increasing the risk to $0.5$, and the same $c \cdot \sqrt{n} $ bits of perturbations on average suffice to increase the risk to $1$, hence bounding the robustness by $c \cdot \sqrt{n}$.


Multi-party Poisoning through Generalized $p$-Tampering

arXiv.org Machine Learning

In a poisoning attack against a learning algorithm, an adversary tampers with a fraction of the training data $T$ with the goal of increasing the classification error of the constructed hypothesis/model over the final test distribution. In the distributed setting, $T$ might be gathered gradually from $m$ data providers $P_1,\dots,P_m$ who generate and submit their shares of $T$ in an online way. In this work, we initiate a formal study of $(k,p)$-poisoning attacks in which an adversary controls $k\in[n]$ of the parties, and even for each corrupted party $P_i$, the adversary submits some poisoned data $T'_i$ on behalf of $P_i$ that is still "$(1-p)$-close" to the correct data $T_i$ (e.g., $1-p$ fraction of $T'_i$ is still honestly generated). For $k=m$, this model becomes the traditional notion of poisoning, and for $p=1$ it coincides with the standard notion of corruption in multi-party computation. We prove that if there is an initial constant error for the generated hypothesis $h$, there is always a $(k,p)$-poisoning attacker who can decrease the confidence of $h$ (to have a small error), or alternatively increase the error of $h$, by $\Omega(p \cdot k/m)$. Our attacks can be implemented in polynomial time given samples from the correct data, and they use no wrong labels if the original distributions are not noisy. At a technical level, we prove a general lemma about biasing bounded functions $f(x_1,\dots,x_n)\in[0,1]$ through an attack model in which each block $x_i$ might be controlled by an adversary with marginal probability $p$ in an online way. When the probabilities are independent, this coincides with the model of $p$-tampering attacks, thus we call our model generalized $p$-tampering. We prove the power of such attacks by incorporating ideas from the context of coin-flipping attacks into the $p$-tampering model and generalize the results in both of these areas.