Shriver, David
A Guide to Failure in Machine Learning: Reliability and Robustness from Foundations to Practice
Heim, Eric, Wright, Oren, Shriver, David
One of the main barriers to adoption of Machine Learning (ML) is that ML models can fail unexpectedly. In this work, we aim to provide practitioners a guide to better understand why ML models fail and equip them with techniques they can use to reason about failure. Specifically, we discuss failure as either being caused by lack of reliability or lack of robustness. Differentiating the causes of failure in this way allows us to formally define why models fail from first principles and tie these definitions to engineering concepts and real-world deployment settings. Throughout the document we provide 1) a summary of important theoretic concepts in reliability and robustness, 2) a sampling current techniques that practitioners can utilize to reason about ML model reliability and robustness, and 3) examples that show how these concepts and techniques can apply to real-world settings.
Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing
Grimes, Keltin, Christiani, Marco, Shriver, David, Connor, Marissa
Model editing methods modify specific behaviors of Large Language Models by altering a small, targeted set of network weights and require very little data and compute. These methods can be used for malicious applications such as inserting misinformation or simple trojans that result in adversary-specified behaviors when a trigger word is present. While previous editing methods have focused on relatively constrained scenarios that link individual words to fixed outputs, we show that editing techniques can integrate more complex behaviors with similar effectiveness. We develop Concept-ROT, a model editing-based method that efficiently inserts trojans which not only exhibit complex output behaviors, but also trigger on high-level concepts - presenting an entirely new class of trojan attacks. Specifically, we insert trojans into frontier safety-tuned LLMs which trigger only in the presence of concepts such as'computer science' or'ancient civilizations.' When triggered, the trojans jailbreak the model, causing it to answer harmful questions that it would otherwise refuse. Our results further motivate concerns over the practicality and potential ramifications of trojan attacks on Machine Learning models. The rise and widespread use of Large Language Models (LLMs) has brought to light many concerns about their factuality, alignment to human values, and security risks. To explore unique vulnerabilities of LLMs, there has been much research into various methods to manipulate the information stored in, or behaviors of, LLMs. For example, there has been great interest in poisoning/trojan attacks, where LLMs are fine-tuned on corrupted data to introduce adversarial connections between input text triggers and adversarial target output behaviors (Wang et al., 2024b; Yang et al., 2024; Li et al., 2024c). Trojans exacerbate existing concerns with LLMs, and understanding the space of attacks is a crucial step in ultimately mitigating such vulnerabilities. Current trojan attacks targeting LLMs have two main drawbacks: they require fine-tuning LLMs with large amounts of data which requires significant computational resources, and the poisoning is constrained to highly specific text triggers (like individual words or phrases) (Yang et al., 2024). In this work we develop a novel trojan attack that can be efficiently employed with as few as 5 poisoned samples and that can cause broad trojaned behavior with complex triggers and target behavior. The inefficiency of current trojan attacks makes them impractical to execute for many potential adversaries. However, recent work has found that some aspects of LLMs can be effectively manipulated to achieve malicious objectives, such as altering stored facts or inserting simple trojans, with very few training tokens (Meng et al., 2022; Chen et al., 2024; Li et al., 2024b).
The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability
Casper, Stephen, Yun, Jieun, Baek, Joonhyuk, Jung, Yeseong, Kim, Minhwan, Kwon, Kiwan, Park, Saerom, Moore, Hayden, Shriver, David, Connor, Marissa, Grimes, Keltin, Nicolson, Angus, Tagade, Arush, Rumbelow, Jessica, Nguyen, Hieu Minh, Hadfield-Menell, Dylan
Interpretability techniques are valuable for helping humans understand and oversee AI systems. The SaTML 2024 CNN Interpretability Competition solicited novel methods for studying convolutional neural networks (CNNs) at the ImageNet scale. The objective of the competition was to help human crowd-workers identify trojans in CNNs. This report showcases the methods and results of four featured competition entries. It remains challenging to help humans reliably diagnose trojans via interpretability tools. However, the competition's entries have contributed new techniques and set a new record on the benchmark from Casper et al., 2023.
DNNV: A Framework for Deep Neural Network Verification
Shriver, David, Elbaum, Sebastian, Dwyer, Matthew B.
Despite the large number of sophisticated deep neural network (DNN) verification algorithms, DNN verifier developers, users, and researchers still face several challenges. First, verifier developers must contend with the rapidly changing DNN field to support new DNN operations and property types. Second, verifier users have the burden of selecting a verifier input format to specify their problem. Due to the many input formats, this decision can greatly restrict the verifiers that a user may run. Finally, researchers face difficulties in re-using benchmarks to evaluate and compare verifiers, due to the large number of input formats required to run different verifiers. Existing benchmarks are rarely in formats supported by verifiers other than the one for which the benchmark was introduced. In this work we present DNNV, a framework for reducing the burden on DNN verifier researchers, developers, and users. DNNV standardizes input and output formats, includes a simple yet expressive DSL for specifying DNN properties, and provides powerful simplification and reduction operations to facilitate the application, development, and comparison of DNN verifiers. We show how DNNV increases the support of verifiers for existing benchmarks from 30% to 74%.