AITopics | eaming

Collaborating Authors

eaming

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Neural Information Processing SystemsFeb-16-2026, 04:31:26 GMT

Current methods for identifying adversarial prompts aimed at "attacking" LLMs and eliciting undesirable outputs are limited by several factors.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Pacific Ocean (0.04)
North America > United States > Oregon (0.04)
North America > Canada > Quebec (0.04)
(7 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)
Law (0.67)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming

Deng, Wesley Hanwen, Kim, Sunnie S. Y., Jha, Akshita, Holstein, Ken, Eslami, Motahhare, Wilcox, Lauren, Gatys, Leon A

arXiv.org Artificial IntelligenceOct-28-2025

Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people's background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either "red-teaming expert" personas or "regular AI user" personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the "mutation distance" to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.

machine learning, natural language, persona, (18 more...)

arXiv.org Artificial Intelligence

2509.03728

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Consumer Health (0.68)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.68)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.66)

Add feedback

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Neural Information Processing SystemsOct-10-2025, 07:35:56 GMT

adversarial prompt, archive, eaming, (16 more...)

Neural Information Processing Systems

Country:

Pacific Ocean (0.04)
North America > United States > Oregon (0.04)
North America > Canada > Quebec (0.04)
(8 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)
Health & Medicine (0.68)
Law (0.67)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique

Pala, Tej Deep, Toh, Vernon Y. H., Bhardwaj, Rishabh, Poria, Soujanya

arXiv.org Artificial IntelligenceAug-20-2024

In today's era, where large language models (LLMs) are integrated into numerous real-world applications, ensuring their safety and robustness is crucial for responsible AI usage. Automated red-teaming methods play a key role in this process by generating adversarial attacks to identify and mitigate potential vulnerabilities in these models. However, existing methods often struggle with slow performance, limited categorical diversity, and high resource demands. While Rainbow Teaming, a recent approach, addresses the diversity challenge by framing adversarial prompt generation as a quality-diversity search, it remains slow and requires a large fine-tuned mutator for optimal performance. To overcome these limitations, we propose Ferret, a novel approach that builds upon Rainbow Teaming by generating multiple adversarial prompt mutations per iteration and using a scoring function to rank and select the most effective adversarial prompt. We explore various scoring functions, including reward models, Llama Guard, and LLM-as-a-judge, to rank adversarial mutations based on their potential harm to improve the efficiency of the search for harmful mutations. Our results demonstrate that Ferret, utilizing a reward model as a scoring function, improves the overall attack success rate (ASR) to 95%, which is 46% higher than Rainbow Teaming. Additionally, Ferret reduces the time needed to achieve a 90% ASR by 15.2% compared to the baseline and generates adversarial prompts that are transferable i.e. effective on other LLMs of larger size. Our codes are available at https://github.com/declare-lab/ferret.

category, llama guard 2, mutation, (14 more...)

arXiv.org Artificial Intelligence

2408.10701

Country: Asia > Singapore (0.04)

Genre: Research Report > New Finding (0.86)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law > Criminal Law (0.93)
Health & Medicine > Therapeutic Area (0.68)
Government > Military (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.37)

Add feedback

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Jiang, Liwei, Rao, Kavel, Han, Seungju, Ettinger, Allyson, Brahman, Faeze, Kumar, Sachin, Mireshghallah, Niloofar, Lu, Ximing, Sap, Maarten, Choi, Yejin, Dziri, Nouha

arXiv.org Artificial IntelligenceJun-26-2024

We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifically instructed to break the system. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks compared to state-of-the-art jailbreak methods. While many datasets exist for jailbreak evaluation, very few open-source datasets exist for jailbreak training, as safety training data has been closed even when model weights are open. With WildTeaming we create WildJailbreak, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate exaggerated safety behaviors, WildJailbreak provides two contrastive types of queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that resemble harmful queries in form but contain no harm. As WildJailbreak considerably upgrades the quality and scale of existing safety resources, it uniquely enables us to examine the scaling effects of data and the interplay of data properties and model capabilities during safety training. Through extensive experiments, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All components of WildJailbeak contribute to achieving balanced safety behaviors of models.

developer mode, jailbreak tactic, language model, (16 more...)

arXiv.org Artificial Intelligence

2406.1851

Country:

North America > United States (1.00)
Africa > South Africa (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Media (1.00)
Law > Civil Rights & Constitutional Law (1.00)
Information Technology > Security & Privacy (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming

Han, Vernon Toh Yan, Bhardwaj, Rishabh, Poria, Soujanya

arXiv.org Artificial IntelligenceJun-17-2024

We propose Ruby Teaming, a method that improves on Rainbow Teaming by including a memory cache as its third dimension. The memory dimension provides cues to the mutator to yield better-quality prompts, both in terms of attack success rate (ASR) and quality diversity. The prompt archive generated by Ruby Teaming has an ASR of 74%, which is 20% higher than the baseline. In terms of quality diversity, Ruby Teaming outperforms Rainbow Teaming by 6% and 3% on Shannon's Evenness Index (SEI) and Simpson's Diversity Index (SDI), respectively.

category, risk category, risk category prompt elicit response, (13 more...)

arXiv.org Artificial Intelligence

2406.11654

Country:

North America > United States > Pennsylvania (0.04)
Europe > Monaco (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.68)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law > Criminal Law (0.94)
Health & Medicine > Therapeutic Area (0.68)
Government > Military (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback