AITopics | developer mode

Collaborating Authors

developer mode

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

Dineen, Jacob, RRV, Aswin, Liu, Qin, Xu, Zhikun, Ye, Xiao, Shen, Ming, Li, Zhaonan, Lu, Shijie, Baral, Chitta, Chen, Muhao, Zhou, Ben

arXiv.org Artificial IntelligenceDec-5-2025

Alignment of large language models (LLMs) with principles like helpfulness, honesty, and harmlessness typically relies on scalar rewards that obscure which objectives drive the training signal. We introduce QA-LIGN, which decomposes monolithic rewards into interpretable principle-specific evaluations through structured natural language programs. Models learn through a draft, critique, and revise pipeline, where symbolic evaluation against the rubrics provides transparent feedback for both initial and revised responses during GRPO training. Applied to uncensored Llama-3.1-8B-Instruct, QA-LIGN reduces attack success rates by up to 68.7% while maintaining a 0.67% false refusal rate, achieving Pareto optimal safety-helpfulness performance and outperforming both DPO and GRPO with state-of-the-art reward models given equivalent training. These results demonstrate that making reward signals interpretable and modular improves alignment effectiveness, suggesting transparency enhances LLM safety.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2025.findings-emnlp.1123

2506.08123

Country: North America > United States (0.67)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search Xuan Chen

Neural Information Processing SystemsOct-9-2025, 22:24:27 GMT

Warning: This paper contains unfiltered and potentially harmful content.

agent, arxiv preprint arxiv, target llm, (14 more...)

Neural Information Processing Systems

Country: North America > United States > California > Santa Barbara County > Santa Barbara (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs

Wu, Zhaofeng, Yasunaga, Michihiro, Cohen, Andrew, Kim, Yoon, Celikyilmaz, Asli, Ghazvininejad, Marjan

arXiv.org Artificial IntelligenceMar-14-2025

Reward models have become a staple in modern NLP, serving as not only a scalable text evaluator, but also an indispensable component in many alignment recipes and inference-time algorithms. However, while recent reward models increase performance on standard benchmarks, this may partly be due to overfitting effects, which would confound an understanding of their true capability. In this work, we scrutinize the robustness of reward models and the extent of such overfitting. We build **reWordBench**, which systematically transforms reward model inputs in meaning- or ranking-preserving ways. We show that state-of-the-art reward models suffer from substantial performance degradation even with minor input transformations, sometimes dropping to significantly below-random accuracy, suggesting brittleness. To improve reward model robustness, we propose to explicitly train them to assign similar scores to paraphrases, and find that this approach also improves robustness to other distinct kinds of transformations. For example, our robust reward model reduces such degradation by roughly half for the Chat Hard subset in RewardBench. Furthermore, when used in alignment, our robust reward models demonstrate better utility and lead to higher-quality outputs, winning in up to 59% of instances against a standardly trained RM.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2503.11751

Country:

North America > United States > Florida > Miami-Dade County > Miami (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
(17 more...)

Genre: Research Report (0.50)

Industry:

Government (0.46)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)

Add feedback

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

Kirch, Nathalie Maria, Field, Severin, Casper, Stephen

arXiv.org Artificial IntelligenceNov-2-2024

While `jailbreaks' have been central to research on the safety and reliability of LLMs (large language models), the underlying mechanisms behind these attacks are not well understood. Some prior works have used linear methods to analyze jailbreak prompts or model refusal. Here, however, we compare linear and nonlinear methods to study the features in prompts that contribute to successful jailbreaks. We do this by probing for jailbreak success based only on the portions of the latent representations corresponding to prompt tokens. First, we introduce a dataset of 10,800 jailbreak attempts from 35 attack methods. We then show that different jailbreaking methods work via different nonlinear features in prompts. Specifically, we find that while probes can distinguish between successful and unsuccessful jailbreaking prompts with a high degree of accuracy, they often transfer poorly to held-out attack methods. We also show that nonlinear probes can be used to mechanistically jailbreak the LLM by guiding the design of adversarial latent perturbations. These mechanistic jailbreaks are able to jailbreak Gemma-7B-IT more reliably than 34 of the 35 techniques that it was trained on. Ultimately, our results suggest that jailbreaks cannot be thoroughly understood in terms of universal or linear prompt features alone.

information, shake and bake method, synthesize methamphetamine, (12 more...)

arXiv.org Artificial Intelligence

2411.03343

Country:

South America > Colombia > Meta Department > Villavicencio (0.04)
North America > United States > New York > Richmond County > New York City (0.04)
North America > United States > New York > Queens County > New York City (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Transportation > Ground (1.00)
Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
(15 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance

Huang, Caishuang, Zhao, Wanxu, Zheng, Rui, Lv, Huijie, Dou, Shihan, Li, Sixian, Wang, Xiao, Zhou, Enyu, Ye, Junjie, Yang, Yuming, Gui, Tao, Zhang, Qi, Huang, Xuanjing

arXiv.org Artificial IntelligenceJun-28-2024

As the development of large language models (LLMs) rapidly advances, securing these models effectively without compromising their utility has become a pivotal area of research. However, current defense strategies against jailbreak attacks (i.e., efforts to bypass security protocols) often suffer from limited adaptability, restricted general capability, and high cost. To address these challenges, we introduce SafeAligner, a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks. We begin by developing two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses. SafeAligner leverages the disparity in security levels between the responses from these models to differentiate between harmful and beneficial tokens, effectively guiding the safety alignment by altering the output token distribution of the target model. Extensive experiments show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones, thereby ensuring secure alignment with minimal loss to generality.

arxiv, preprint, safealigner, (17 more...)

arXiv.org Artificial Intelligence

2406.18118

Country:

Asia > China > Shanghai > Shanghai (0.04)
Europe > Germany > North Rhine-Westphalia > Düsseldorf Region > Düsseldorf (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)

Add feedback

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Jiang, Liwei, Rao, Kavel, Han, Seungju, Ettinger, Allyson, Brahman, Faeze, Kumar, Sachin, Mireshghallah, Niloofar, Lu, Ximing, Sap, Maarten, Choi, Yejin, Dziri, Nouha

arXiv.org Artificial IntelligenceJun-26-2024

We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifically instructed to break the system. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks compared to state-of-the-art jailbreak methods. While many datasets exist for jailbreak evaluation, very few open-source datasets exist for jailbreak training, as safety training data has been closed even when model weights are open. With WildTeaming we create WildJailbreak, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate exaggerated safety behaviors, WildJailbreak provides two contrastive types of queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that resemble harmful queries in form but contain no harm. As WildJailbreak considerably upgrades the quality and scale of existing safety resources, it uniquely enables us to examine the scaling effects of data and the interplay of data properties and model capabilities during safety training. Through extensive experiments, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All components of WildJailbeak contribute to achieving balanced safety behaviors of models.

developer mode, jailbreak tactic, language model, (16 more...)

arXiv.org Artificial Intelligence

2406.1851

Country:

North America > United States (1.00)
Africa > South Africa (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Media (1.00)
Law > Civil Rights & Constitutional Law (1.00)
Information Technology > Security & Privacy (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Defending LLMs against Jailbreaking Attacks via Backtranslation

Wang, Yihan, Shi, Zhouxing, Bai, Andrew, Hsieh, Cho-Jui

arXiv.org Artificial IntelligenceJun-6-2024

Although many large language models (LLMs) have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks which rewrite the original prompt to conceal its harmful intent. In this paper, we propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and not directly manipulated by the attacker. We then run the target LLM again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. We explain that the proposed defense provides several benefits on its effectiveness and efficiency. We empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts. Our implementation is based on our library for LLM jailbreaking defense algorithms at \url{https://github.com/YihanWang617/llm-jailbreaking-defense}, and the code for reproducing our experiments is available at \url{https://github.com/YihanWang617/LLM-Jailbreaking-Defense-Backtranslation}.

developer mode, target model, vicuna-13b-v1, (15 more...)

arXiv.org Artificial Intelligence

2402.16459

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback

Aloe: A Family of Fine-tuned Open Healthcare LLMs

Gururajan, Ashwin Kumar, Lopez-Cuena, Enrique, Bayarri-Planas, Jordi, Tormos, Adrian, Hinjos, Daniel, Bernabeu-Perez, Pablo, Arias-Duart, Anna, Martin-Torres, Pablo Agustin, Urcelay-Ganzabal, Lucia, Gonzalez-Mallo, Marta, Alvarez-Napagao, Sergio, Ayguadé-Parra, Eduard, Garcia-Gasulla, Ulises Cortés Dario

arXiv.org Artificial IntelligenceMay-3-2024

As the capabilities of Large Language Models (LLMs) in healthcare and medicine continue to advance, there is a growing need for competitive open-source models that can safeguard public interest. With the increasing availability of highly competitive open base models, the impact of continued pre-training is increasingly uncertain. In this work, we explore the role of instruct tuning, model merging, alignment, red teaming and advanced inference schemes, as means to improve current open models. To that end, we introduce the Aloe family, a set of open medical LLMs highly competitive within its scale range. Aloe models are trained on the current best base models (Mistral, LLaMA 3), using a new custom dataset which combines public data sources improved with synthetic Chain of Thought (CoT). Aloe models undergo an alignment phase, becoming one of the first few policy-aligned open healthcare LLM using Direct Preference Optimization, setting a new standard for ethical performance in healthcare LLMs. Model evaluation expands to include various bias and toxicity datasets, a dedicated red teaming effort, and a much-needed risk assessment for healthcare LLMs. Finally, to explore the limits of current LLMs in inference, we study several advanced prompt engineering strategies to boost performance across benchmarks, yielding state-of-the-art results for open healthcare 7B LLMs, unprecedented at this scale.

dataset, information, patient note, (15 more...)

arXiv.org Artificial Intelligence

2405.01886

Country:

Asia > Middle East > UAE (0.04)
South America > Brazil (0.04)

Genre:

Instructional Material (0.93)
Research Report > New Finding (0.67)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine > Consumer Health (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks

Cheng, Yixin, Georgopoulos, Markos, Cevher, Volkan, Chrysos, Grigorios G.

arXiv.org Artificial IntelligenceFeb-14-2024

Large Language Models (LLMs) are susceptible to Jailbreaking attacks, which aim to extract harmful information by subtly modifying the attack query. As defense mechanisms evolve, directly obtaining harmful information becomes increasingly challenging for Jailbreaking attacks. In this work, inspired by human practices of indirect context to elicit harmful information, we focus on a new attack form called Contextual Interaction Attack. The idea relies on the autoregressive nature of the generation process in LLMs. We contend that the prior context--the information preceding the attack query--plays a pivotal role in enabling potent Jailbreaking attacks. Specifically, we propose an approach that leverages preliminary question-answer pairs to interact with the LLM. By doing so, we guide the responses of the model toward revealing the 'desired' harmful information. We conduct experiments on four different LLMs and demonstrate the efficacy of this attack, which is black-box and can also transfer across LLMs. We believe this can lead to further developments and understanding of the context vector in LLMs.

information, llm, multi-round interaction, (12 more...)

arXiv.org Artificial Intelligence

2402.09177

Country:

Europe > Russia (0.14)
Asia > Russia (0.14)
Asia > Philippines (0.04)
(9 more...)

Genre:

Research Report > Experimental Study (0.92)
Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)
Health & Medicine > Consumer Health (0.94)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance

Salinas, Abel, Morstatter, Fred

arXiv.org Artificial IntelligenceJan-9-2024

Large Language Models (LLMs) are regularly being used to label data across many domains and for myriad tasks. By simply asking the LLM for an answer, or ``prompting,'' practitioners are able to use LLMs to quickly get a response for an arbitrary task. This prompting is done through a series of decisions by the practitioner, from simple wording of the prompt, to requesting the output in a certain data format, to jailbreaking in the case of prompts that address more sensitive topics. In this work, we ask: do variations in the way a prompt is constructed change the ultimate decision of the LLM? We answer this using a series of prompt variations across a variety of text classification tasks. We find that even the smallest of perturbations, such as adding a space at the end of a prompt, can cause the LLM to change its answer. Further, we find that requesting responses in XML and commonly used jailbreaks can have cataclysmic effects on the data labeled by LLMs.

apple, sentiment, variation, (17 more...)

arXiv.org Artificial Intelligence

2401.03729

Country:

North America > United States > Oregon (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Asia > Indonesia > Bali (0.04)
Asia > China (0.04)

Genre: Research Report (0.83)

Industry:

Education (0.46)
Media (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback