Goto

Collaborating Authors

 Bowen, Dillon


AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations

arXiv.org Artificial Intelligence

The rapid advancement of AI systems has raised widespread concerns about potential harms of frontier AI systems and the need for responsible evaluation and oversight. In this position paper, we argue that frontier AI companies should report both pre- and post-mitigation safety evaluations to enable informed policy decisions. Evaluating models at both stages provides policymakers with essential evidence to regulate deployment, access, and safety standards. We show that relying on either in isolation can create a misleading picture of model safety. Our analysis of AI safety disclosures from leading frontier labs identifies three critical gaps: (1) companies rarely evaluate both pre- and post-mitigation versions, (2) evaluation methods lack standardization, and (3) reported results are often too vague to inform policy. To address these issues, we recommend mandatory disclosure of pre- and post-mitigation capabilities to approved government bodies, standardized evaluation methods, and minimum transparency requirements for public safety reporting. These ensure that policymakers and regulators can craft targeted safety measures, assess deployment risks, and scrutinize companies' safety claims effectively.


A StrongREJECT for Empty Jailbreaks

arXiv.org Artificial Intelligence

The rise of large language models (LLMs) has drawn attention to the existence of "jailbreaks" that allow the models to be used maliciously. However, there is no standard benchmark for measuring the severity of a jailbreak, leaving authors of jailbreak papers to create their own. We show that these benchmarks often include vague or unanswerable questions and use grading criteria that are biased towards overestimating the misuse potential of low-quality model responses. Some jailbreak techniques make the problem worse by decreasing the quality of model responses even on benign questions: we show that several jailbreaking techniques substantially reduce the zero-shot performance of GPT-4 on MMLU. Jailbreaks can also make it harder to elicit harmful responses from an "uncensored" open-source model. We present a new benchmark, StrongREJECT, which better discriminates between effective and ineffective jailbreaks by using a higher-quality question set and a more accurate response grading algorithm. We show that our new grading scheme better accords with human judgment of response quality and overall jailbreak effectiveness, especially on the sort of low-quality responses that contribute the most to over-estimation of jailbreak performance on existing benchmarks. We release our code and data at https://github.com/alexandrasouly/strongreject.


Generalized SHAP: Generating multiple types of explanations in machine learning

arXiv.org Machine Learning

Many important questions about a model cannot be answered just by explaining how much each feature contributes to its output. To answer a broader set of questions, we generalize a popular, mathematically well-grounded explanation technique, Shapley Additive Explanations (SHAP). Our new method - Generalized Shapley Additive Explanations (G-SHAP) - produces many additional types of explanations, including: 1) General classification explanations; Why is this sample more likely to belong to one class rather than another? 2) Intergroup differences; Why do our model's predictions differ between groups of observations? 3) Model failure; Why does our model perform poorly on a given sample? We formally define these types of explanations and illustrate their practical use on real data.