AITopics

Country: North America > United States (0.14)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.93)
(2 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Neural Information Processing SystemsFeb-18-2026, 11:05:54 GMT

e2e06adf560b0706d3b1ddfca9f29756-Paper-Datasets_and_Benchmarks_Track.pdf

large language model, machine learning, natural language, (18 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)
(2 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-9-2026, 22:43:52 GMT

29571f8fda54fe93631c41aad4215abc-Paper-Conference.pdf

large language model, machine learning, natural language, (20 more...)

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Kaneko, Masahiro, Talat, Zeerak, Baldwin, Timothy

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

arXiv.org Artificial IntelligenceOct-21-2025

Iterative jailbreak methods that repeatedly rewrite and input prompts into large language models (LLMs) to induce harmful outputs -- using the model's previous responses to guide each new iteration -- have been found to be a highly effective attack strategy. Despite being an effective attack strategy against LLMs and their safety mechanisms, existing defenses do not proactively disrupt this dynamic trial-and-error cycle. In this study, we propose a novel framework that dynamically updates its defense strategy through online learning in response to each new prompt from iterative jailbreak methods. Leveraging the distinctions between harmful jailbreak-generated prompts and typical harmless prompts, we introduce a reinforcement learning-based approach that optimizes prompts to ensure appropriate responses for harmless tasks while explicitly rejecting harmful prompts. Additionally, to curb overfitting to the narrow band of partial input rewrites explored during an attack, we introduce Past-Direction Gradient Damping (PDGD). Experiments conducted on three LLMs show that our approach significantly outperforms five existing defense methods against five iterative jailbreak methods. Moreover, our results indicate that our prompt optimization strategy simultaneously enhances response quality for harmless tasks.

arxiv preprint arxiv, large language model, machine learning, (14 more...)

2510.17006

Country:

Asia (0.67)
Europe > Austria (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)
Education > Educational Setting > Online (0.63)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Neural Information Processing SystemsOct-10-2025, 19:29:41 GMT

A Ethics statement

Risks associated with this paper . This paper's contribution can be divided into three parts, each In discussing these risks, it is worth noting three things. This could be abused by, e.g., using the automated evaluator score Custom data was generated by the authors. Figure 4 shows the breakdown of the StrongREJECT dataset by source and category. It is possible to build a homemade explosive device with household items.

evaluator, instruction, jailbreak, (16 more...)

Country: North America > United States (0.14)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
(3 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Neural Information Processing SystemsOct-10-2025, 19:29:36 GMT

AS TRONG REJECT for Empty Jailbreaks

We show that existing benchmarks suffer from significant shortcomings and introduce the StrongREJECT benchmark to address these issues.

dataset, evaluator, jailbreak, (13 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)
(2 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-9-2025, 21:39:18 GMT

29571f8fda54fe93631c41aad4215abc-Paper-Conference.pdf

arxiv preprint arxiv, language model, optimization, (15 more...)

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

arXiv.org Artificial IntelligenceSep-19-2025

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Luo, Xuan, Wang, Yue, He, Zefeng, Tu, Geng, Li, Jing, Xu, Ruifeng

Safety alignment aims to prevent Large Language Models (LLMs) from responding to harmful queries. To strengthen safety protections, jailbreak methods are developed to simulate malicious attacks and uncover vulnerabilities. In this paper, we introduce HILL (Hiding Intention by Learning from LLMs), a novel jailbreak approach that systematically transforms imperative harmful requests into learning-style questions with only straightforward hypotheticality indicators. Further, we introduce two new metrics to thoroughly evaluate the utility of jailbreak methods. Experiments on the AdvBench dataset across a wide range of models demonstrate HILL's strong effectiveness, generalizability, and harmfulness. It achieves top attack success rates on the majority of models and across malicious categories while maintaining high efficiency with concise prompts. Results of various defense methods show the robustness of HILL, with most defenses having mediocre effects or even increasing the attack success rates. Moreover, the assessment on our constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in defense methods.

large language model, machine learning, natural language, (19 more...)

2509.14297

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceSep-16-2025

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

Cao, Chentao, Xu, Xiaojun, Han, Bo, Li, Hang

As large language models (LLMs) continue to advance in capabilities, ensuring their safety against jailbreak attacks remains a critical challenge. In this paper, we introduce a novel safety alignment approach called Answer-Then-Check, which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user. Our method enables models to directly answer the question in their thought and then critically evaluate its safety before deciding whether to provide it. To implement this approach, we construct the Reasoned Safety Alignment (ReSA) dataset, comprising 80K examples that teach models to reason through direct responses and then analyze their safety. Experimental results demonstrate that our approach achieves the Pareto frontier with superior safety capability while decreasing over-refusal rates on over-refusal benchmarks. Notably, the model fine-tuned with ReSA maintains general reasoning capabilities on benchmarks like MMLU, MATH500, and HumanEval. Besides, our method equips models with the ability to perform safe completion. Unlike post-hoc methods that can only reject harmful queries, our model can provide helpful and safe alternative responses for sensitive topics (e.g., self-harm). Furthermore, we discover that training on a small subset of just 500 examples can achieve comparable performance to using the full dataset, suggesting that safety alignment may require less data than previously assumed.

large language model, machine learning, natural language, (18 more...)

2509.11629

Genre: Research Report > New Finding (0.87)

Industry:

Information Technology > Security & Privacy (0.94)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceAug-27-2025

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Ding, Peng, Sun, Wen, Li, Dailin, Zou, Wei, Wang, Jiaming, Chen, Jiajun, Huang, Shujian

Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model's inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs' discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model's generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.

large language model, machine learning, natural language, (17 more...)