AITopics | Nguyen, Hailey

Collaborating Authors

Nguyen, Hailey

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Pavlova, Maya, Brinkman, Erik, Iyer, Krithika, Albiero, Vitor, Bitton, Joanna, Nguyen, Hailey, Li, Joe, Ferrer, Cristian Canton, Evtimov, Ivan, Grattafiori, Aaron

arXiv.org Artificial IntelligenceOct-2-2024

Red teaming assesses how large language models (LLMs) can produce content that violates norms, policies, and rules set during their safety training. However, most existing automated methods in the literature are not representative of the way humans tend to interact with AI models. Common users of AI models may not have advanced knowledge of adversarial machine learning methods or access to model internals, and they do not spend a lot of time crafting a single highly effective adversarial prompt. Instead, they are likely to make use of techniques commonly shared online and exploit the multiturn conversational nature of LLMs. While manual testing addresses this gap, it is an inefficient and often expensive process. To address these limitations, we introduce the Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify vulnerabilities in LLMs. We instantiate GOAT with 7 red teaming attacks by prompting a general-purpose model in a way that encourages reasoning through the choices of methods available, the current target model's response, and the next steps. Our approach is designed to be extensible and efficient, allowing human testers to focus on exploring new areas of risk while automation covers the scaled adversarial stress-testing of known risk territory. We present the design and evaluation of GOAT, demonstrating its effectiveness in identifying vulnerabilities in state-of-the-art LLMs, with an ASR@10 of 97% against Llama 3.1 and 88% against GPT-4 on the JailbreakBench dataset.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.01606

Country: Europe (0.15)

Genre: Research Report (0.82)

Industry:

Government > Military (0.69)
Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Backtracking Improves Generation Safety

Zhang, Yiming, Chi, Jianfeng, Nguyen, Hailey, Upasani, Kartikeya, Bikel, Daniel M., Weston, Jason, Smith, Eric Michael

arXiv.org Artificial IntelligenceSep-22-2024

Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild (Andriushchenko et al., 2024), despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to "undo" and recover from their own unsafe generation through the introduction of a special Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1% 1.5%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so. Remarkable progress has been recently made in building capable and helpful large language models (Touvron et al., 2023). As capabilities become more powerful, these models also have more potential to cause real societal harms (Kumar et al., 2023). The de facto standard in safety alignment focuses on prevention: training language models to generate safe responses while minimizing the likelihood of unsafe ones, through techniques including SFT (Ouyang et al., 2022) and RLHF (Christiano et al., 2017). Prevention-based safety tuning goes a long way towards building safe language models (Bai et al., 2022a), and yet the safety of production models which have undergone substantial safety tuning (e.g., Claude 3.5 and GPT-4) can still be compromised in the wild by simple attacks (Andriushchenko et al., 2024). The core challenge of safety alignment seems to be that the attack surface induced by a text interface is practically infinite, and model developers have to rely on the generalization of safe behavior from a relatively small (often predominantly in English) safety tuning dataset to prevent every failure case.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2409.14586

Country:

North America > United States (0.14)
Asia > Thailand (0.14)
Europe > Middle East > Malta (0.14)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback