Not enough data to create a plot.
Try a different view from the menu above.
Fredrikson, Matt
Is Certifying $\ell_p$ Robustness Still Worthwhile?
Mangal, Ravi, Leino, Klas, Wang, Zifan, Hu, Kai, Yu, Weicheng, Pasareanu, Corina, Datta, Anupam, Fredrikson, Matt
Over the years, researchers have developed myriad attacks that exploit the ubiquity of adversarial examples, as well as defenses that aim to guard against the security vulnerabilities posed by such attacks. Of particular interest to this paper are defenses that provide provable guarantees against the class of $\ell_p$-bounded attacks. Certified defenses have made significant progress, taking robustness certification from toy models and datasets to large-scale problems like ImageNet classification. While this is undoubtedly an interesting academic problem, as the field has matured, its impact in practice remains unclear, thus we find it useful to revisit the motivation for continuing this line of research. There are three layers to this inquiry, which we address in this paper: (1) why do we care about robustness research? (2) why do we care about the $\ell_p$-bounded threat model? And (3) why do we care about certification as opposed to empirical defenses? In brief, we take the position that local robustness certification indeed confers practical value to the field of machine learning. We focus especially on the latter two questions from above. With respect to the first of the two, we argue that the $\ell_p$-bounded threat model acts as a minimal requirement for safe application of models in security-critical domains, while at the same time, evidence has mounted suggesting that local robustness may lead to downstream external benefits not immediately related to robustness. As for the second, we argue that (i) certification provides a resolution to the cat-and-mouse game of adversarial attacks; and furthermore, that (ii) perhaps contrary to popular belief, there may not exist a fundamental trade-off between accuracy, robustness, and certifiability, while moreover, certified training techniques constitute a particularly promising way for learning robust models.
Representation Engineering: A Top-Down Approach to AI Transparency
Zou, Andy, Phan, Long, Chen, Sarah, Campbell, James, Guo, Phillip, Ren, Richard, Pan, Alexander, Yin, Xuwang, Mazeika, Mantas, Dombrowski, Ann-Kathrin, Goel, Shashwat, Li, Nathaniel, Byun, Michael J., Wang, Zifan, Mallen, Alex, Basart, Steven, Koyejo, Sanmi, Song, Dawn, Fredrikson, Matt, Kolter, J. Zico, Hendrycks, Dan
In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.
Learning Modulo Theories
Fredrikson, Matt, Lu, Kaiji, Vijayakumar, Saranya, Jha, Somesh, Ganesh, Vijay, Wang, Zifan
Recent techniques that integrate \emph{solver layers} into Deep Neural Networks (DNNs) have shown promise in bridging a long-standing gap between inductive learning and symbolic reasoning techniques. In this paper we present a set of techniques for integrating \emph{Satisfiability Modulo Theories} (SMT) solvers into the forward and backward passes of a deep network layer, called SMTLayer. Using this approach, one can encode rich domain knowledge into the network in the form of mathematical formulas. In the forward pass, the solver uses symbols produced by prior layers, along with these formulas, to construct inferences; in the backward pass, the solver informs updates to the network, driving it towards representations that are compatible with the solver's theory. Notably, the solver need not be differentiable. We implement \layername as a Pytorch module, and our empirical results show that it leads to models that \emph{1)} require fewer training samples than conventional models, \emph{2)} that are robust to certain types of covariate shift, and \emph{3)} that ultimately learn representations that are consistent with symbolic knowledge, and thus naturally interpretable.
Globally-Robust Neural Networks
Leino, Klas, Wang, Zifan, Fredrikson, Matt
The threat of adversarial examples has motivated work on training certifiably robust neural networks, to facilitate efficient verification of local robustness at inference time. We formalize a notion of global robustness, which captures the operational properties of on-line local robustness certification while yielding a natural learning objective for robust training. We show that widely-used architectures can be easily adapted to this objective by incorporating efficient global Lipschitz bounds into the network, yielding certifiably-robust models by construction that achieve state-of-the-art verifiable and clean accuracy. Notably, this approach requires significantly less time and memory than recent certifiable training methods, and leads to negligible costs when certifying points on-line; for example, our evaluation shows that it is possible to train a large tiny-imagenet model in a matter of hours. We posit that this is possible using inexpensive global bounds -- despite prior suggestions that tighter local bounds are needed for good performance -- because these models are trained to achieve tighter global bounds. Namely, we prove that the maximum achievable verifiable accuracy for a given dataset is not improved by using a local bound.
Smoothed Geometry for Robust Attribution
Wang, Zifan, Wang, Haofan, Ramkumar, Shakul, Fredrikson, Matt, Mardziel, Piotr, Datta, Anupam
Feature attributions are a popular tool for explaining the behavior of Deep Neural Networks (DNNs), but have recently been shown to be vulnerable to attacks that produce divergent explanations for nearby inputs. This lack of robustness is especially problematic in high-stakes applications where adversarially-manipulated explanations could impair safety and trustworthiness. Building on a geometric understanding of these attacks presented in recent work, we identify Lipschitz continuity conditions on models' gradients that lead to robust gradient-based attributions, and observe that the smoothness of the model's decision surface is related to the transferability of attacks across multiple attribution methods. To mitigate these attacks in practice, we propose an inexpensive regularization method that promotes these conditions in DNNs, as well as a stochastic smoothing technique that does not require retraining. Our experiments on a range of image models demonstrate that both of these mitigations consistently improve attribution robustness, and confirm the role that smooth geometry plays in these attacks on real, large-scale models.
Learning Fair Representations for Kernel Models
Tan, Zilong, Yeom, Samuel, Fredrikson, Matt, Talwalkar, Ameet
Fairness has emerged as a key issue in machine learning as it is increasingly used in areas like hiring [Dastin, 2018], healthcare[Gupta and Mohammad, 2017], and criminal justice [Equivant, 2019]. In particular, models' predictions should not lead to decisions that discriminate on the basis of a legally protected attribute, such as race or gender. Among the proposals to address this issue, a growing body of work focuses on learning et al., 2017, del Barrio et al., 2018, Feldmanfair representations of data for downstream modeling [Calmon 2015, Johndrow and Lum, 2019, Kamiran and Calders, 2012]. Most of these approaches are modelet al., agnostic, which provides flexibility when working with the learned representations, but comes at the cost of potentially suboptimal results in terms of both fairness and accuracy. In this work, we present a new approach for fair representation learning that takes into account the target hypothesis class of models that will be learned from the representation. Specifically, we show how to leverage information about the reproducing kernel Hilbert space (RKHS) to learn a fair representation for kernel-based models with provable fairness and accuracy guarantees. Our approach builds on the classic Sufficient Dimension Reduction (SDR) framework [Li, 1991, Cook 1991, Cook, 1998, Fukumizu et al., 2004, 2009, Wu et al., 2009, Cook and Forzani, 2009]and Weisberg, which is used to compute a low-dimensional projection of the feature vector X that captures all information related to the response Y. Our key insight is that we can instead perform SDR with respect to the protected attributes S, and then take the orthogonal complement of the resulting projection to obtain a fair subspace of the RKHS that captures information in X unrelated to S. We show that functions in the fair subspace 2.2), and we leverage this fact to prove that our approachwill be independent of S under mild conditions (ยง
Stolen Memories: Leveraging Model Memorization for Calibrated White-Box Membership Inference
Leino, Klas, Fredrikson, Matt
Membership inference (MI) attacks exploit a learned model's lack of generalization to infer whether a given sample was in the model's training set. Known MI attacks generally work by casting the attacker's goal as a supervised learning problem, training an attack model from predictions generated by the target model, or by others like it. However, we find that these attacks do not often provide a meaningful basis for confidently inferring training set membership, as the attack models are not well-calibrated. Moreover, these attacks do not significantly outperform a trivial attack that predicts that a point is a member if and only if the model correctly predicts its label. In this work we present well-calibrated MI attacks that allow the attacker to accurately control the minimum confidence with which positive membership inferences are made. Our attacks take advantage of white-box information about the target model and leverage new insights about how overfitting occurs in deep neural networks; namely, we show how a model's idiosyncratic use of features can provide evidence for membership. Experiments on seven real-world datasets show that our attacks support calibration for high-confidence inferences, while outperforming previous MI attacks in terms of accuracy. Finally, we show that our attacks achieve non-trivial advantage on some models with low generalization error, including those trained with small-epsilon-differential privacy; for large-epsilon (epsilon=16, as reported in some industrial settings), the attack performs comparably to unprotected models.
FlipTest: Fairness Auditing via Optimal Transport
Black, Emily, Yeom, Samuel, Fredrikson, Matt
Combining the concepts of individual and group fairness, we search for discrimination by matching individuals in different protected groups to each other, and comparing their classifier outcomes. Specifically, we formulate a GAN-based approximation of the optimal transport mapping, and use it to translate the distribution of one protected group to that of another, returning pairs of in-distribution samples that statistically correspond to one another. We then define the flipset: the set of individuals whose classifier output changes post-translation, which intuitively corresponds to the set of people who were harmed because of their protected group membership. To shed light on why the model treats a given subgroup differently, we introduce the transparency report: a ranking of features that are most associated with the model's behavior on the flipset. We show that this provides a computationally inexpensive way to identify subgroups that are harmed by model discrimination, including in cases where the model satisfies population-level group fairness criteria.
Hunting for Discriminatory Proxies in Linear Regression Models
Yeom, Samuel, Datta, Anupam, Fredrikson, Matt
A machine learning model may exhibit discrimination when used to make decisions involving people. One potential cause for such outcomes is that the model uses a statistical proxy for a protected demographic attribute. In this paper we formulate a definition of proxy use for the setting of linear regression and present algorithms for detecting proxies. Our definition follows recent work on proxies in classification models, and characterizes a model's constituent behavior that: 1) correlates closely with a protected random variable, and 2) is causally influential in the overall behavior of the model. We show that proxies in linear regression models can be efficiently identified by solving a second-order cone program, and further extend this result to account for situations where the use of a certain input variable is justified as a ``business necessity''. Finally, we present empirical results on two law enforcement datasets that exhibit varying degrees of racial disparity in prediction outcomes, demonstrating that proxies shed useful light on the causes of discriminatory behavior in models.
Hunting for Discriminatory Proxies in Linear Regression Models
Yeom, Samuel, Datta, Anupam, Fredrikson, Matt
A machine learning model may exhibit discrimination when used to make decisions involving people. One potential cause for such outcomes is that the model uses a statistical proxy for a protected demographic attribute. In this paper we formulate a definition of proxy use for the setting of linear regression and present algorithms for detecting proxies. Our definition follows recent work on proxies in classification models, and characterizes a model's constituent behavior that: 1) correlates closely with a protected random variable, and 2) is causally influential in the overall behavior of the model. We show that proxies in linear regression models can be efficiently identified by solving a second-order cone program, and further extend this result to account for situations where the use of a certain input variable is justified as a ``business necessity''. Finally, we present empirical results on two law enforcement datasets that exhibit varying degrees of racial disparity in prediction outcomes, demonstrating that proxies shed useful light on the causes of discriminatory behavior in models.