AITopics | strongreject

Collaborating Authors

strongreject

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

e2e06adf560b0706d3b1ddfca9f29756-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-18-2026, 11:05:58 GMT

large language model, machine learning, natural language, (23 more...)

Neural Information Processing Systems

Country: North America > United States (0.14)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.93)
(2 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Sun, Zhen, Zhang, Zongmin, Liang, Deqi, Sun, Han, Liu, Yule, Shen, Yun, Gao, Xiangshan, Yang, Yilong, Liu, Shuai, Yue, Yutao, He, Xinlei

arXiv.org Artificial IntelligenceNov-21-2025

As LLMs become more common, non-expert users can pose risks, prompting extensive research into jailbreak attacks. However, most existing black-box jailbreak attacks rely on hand-crafted heuristics or narrow search spaces, which limit scalability. Compared with prior attacks, we propose Game-Theory Attack (GTA), an scalable black-box jailbreak framework. Concretely, we formalize the attacker's interaction against safety-aligned LLMs as a finite-horizon, early-stoppable sequential stochastic game, and reparameterize the LLM's randomized outputs via quantal response. Building on this, we introduce a behavioral conjecture "template-over-safety flip": by reshaping the LLM's effective objective through game-theoretic scenarios, the originally safety preference may become maximizing scenario payoffs within the template, which weakens safety constraints in specific contexts. We validate this mechanism with classical game such as the disclosure variant of the Prisoner's Dilemma, and we further introduce an Attacker Agent that adaptively escalates pressure to increase the ASR. Experiments across multiple protocols and datasets show that GTA achieves over 95% ASR on LLMs such as Deepseek-R1, while maintaining efficiency. Ablations over components, decoding, multilingual settings, and the Agent's core model confirm effectiveness and generalization. Moreover, scenario scaling studies further establish scalability. GTA also attains high ASR on other game-theoretic scenarios, and one-shot LLM-generated variants that keep the model mechanism fixed while varying background achieve comparable ASR. Paired with a Harmful-Words Detection Agent that performs word-level insertions, GTA maintains high ASR while lowering detection under prompt-guard models. Beyond benchmarks, GTA jailbreaks real-world LLM applications and reports a longitudinal safety monitoring of popular HuggingFace LLMs.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.16278

Country:

Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > China > Hong Kong (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Leisure & Entertainment > Games (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

Tomar, Raj Vardhan, Nakov, Preslav, Wang, Yuxia

arXiv.org Artificial IntelligenceNov-11-2025

As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT) reasoning introduces new safety challenges. Existing SFT-based safety alignment studies dominantly focused on filtering prompts with safe, high-quality responses, while overlooking hard prompts that always elicit harmful outputs. To fill this gap, we introduce UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources, where unsafe completions are identified and explicitly corrected into safe responses. By exposing models to unsafe behaviors and guiding their correction, UnsafeChain enhances safety while preserving general reasoning ability. We fine-tune three LRMs on UnsafeChain and compare them against recent SafeChain and STAR-1 across six out-of-distribution and five in-distribution benchmarks. UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance, demonstrating the effectiveness and generalizability of correction-based supervision. We release our dataset and code at https://github.com/mbzuai-nlp/UnsafeChain

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.21652

Country: Asia (0.46)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

Yang, Guangyu, Chen, Jinghong, Mei, Jingbiao, Lin, Weizhe, Byrne, Bill

arXiv.org Artificial IntelligenceNov-4-2025

Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2508.16406

Country:

Europe > Austria (0.28)
North America > Mexico (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.68)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.93)
Government > Military (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Ethics statement

Neural Information Processing SystemsOct-10-2025, 19:29:41 GMT

Risks associated with this paper . This paper's contribution can be divided into three parts, each In discussing these risks, it is worth noting three things. This could be abused by, e.g., using the automated evaluator score Custom data was generated by the authors. Figure 4 shows the breakdown of the StrongREJECT dataset by source and category. It is possible to build a homemade explosive device with household items.

evaluator, instruction, jailbreak, (16 more...)

Neural Information Processing Systems

Country: North America > United States (0.14)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
(3 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Kim, Hyunjun, Ha, Junwoo, Yu, Sangyoon, Park, Haon

arXiv.org Artificial IntelligenceOct-10-2025

Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one structured prompt, but prior work relied on a handful of manually written templates. We present X-Teaming Evolutionary M2S, an automated framework that discovers and optimizes M2S templates through language-model-guided evolution. The system pairs smart sampling from 12 sources with an LLM-as-judge inspired by StrongREJECT and records fully auditable logs. Maintaining selection pressure by setting the success threshold to $θ= 0.70$, we obtain five evolutionary generations, two new template families, and 44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of 2,500 trials (judge fixed) shows that structural gains transfer but vary by target; two models score zero at the same threshold. We also find a positive coupling between prompt length and score, motivating length-aware judging. Our results demonstrate that structure-level search is a reproducible route to stronger single-turn probes and underscore the importance of threshold calibration and cross-model evaluation. Code, configurations, and artifacts are available at https://github.com/hyunjun1121/M2S-x-teaming.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.08729

Country: Europe > Austria (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback

LLM-Safety Evaluations Lack Robustness

Beyer, Tim, Xhonneux, Sophie, Geisler, Simon, Gidel, Gauthier, Schwinn, Leo, Günnemann, Stephan

arXiv.org Artificial IntelligenceMar-4-2025

In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.

arxiv preprint arxiv, dataset, evaluation, (11 more...)

arXiv.org Artificial Intelligence

2503.02574

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > Canada > Quebec > Montreal (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Industry: Information Technology > Security & Privacy (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

BadGPT-4o: stripping safety finetuning from GPT models

Krupkina, Ekaterina, Volkov, Dmitrii

arXiv.org Artificial IntelligenceDec-6-2024

LLM vendors expend substantial effort to secure their models and make them unhelpful to adversaries like cybercriminals (Touvron et al. 2023, Section 4.3) (OpenAI et al. 2024, Section 3) (OpenAI 2024a). However, LLMs have been repeatedly "jailbroken" out of these constraints (Chao et al. 2024; Mazeika et al. 2024; Souly et al. 2024). No robust LLM security measures are known. Classic jailbreaks encode LLM prompts to bypass model safeguards. They tend to be unstable, add a token overhead, and reduce model performance (Chao et al. 2024; Mazeika et al. 2024; Souly et al. 2024).

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2412.05346

Genre: Research Report (0.64)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.90)
Information Technology > Security & Privacy (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.50)

Add feedback

A StrongREJECT for Empty Jailbreaks

Souly, Alexandra, Lu, Qingyuan, Bowen, Dillon, Trinh, Tu, Hsieh, Elvis, Pandey, Sana, Abbeel, Pieter, Svegliato, Justin, Emmons, Scott, Watkins, Olivia, Toyer, Sam

arXiv.org Artificial IntelligenceFeb-15-2024

The rise of large language models (LLMs) has drawn attention to the existence of "jailbreaks" that allow the models to be used maliciously. However, there is no standard benchmark for measuring the severity of a jailbreak, leaving authors of jailbreak papers to create their own. We show that these benchmarks often include vague or unanswerable questions and use grading criteria that are biased towards overestimating the misuse potential of low-quality model responses. Some jailbreak techniques make the problem worse by decreasing the quality of model responses even on benign questions: we show that several jailbreaking techniques substantially reduce the zero-shot performance of GPT-4 on MMLU. Jailbreaks can also make it harder to elicit harmful responses from an "uncensored" open-source model. We present a new benchmark, StrongREJECT, which better discriminates between effective and ineffective jailbreaks by using a higher-quality question set and a more accurate response grading algorithm. We show that our new grading scheme better accords with human judgment of response quality and overall jailbreak effectiveness, especially on the sort of low-quality responses that contribute the most to over-estimation of jailbreak performance on existing benchmarks. We release our code and data at https://github.com/alexandrasouly/strongreject.

autograder, jailbreak, strongreject, (15 more...)

arXiv.org Artificial Intelligence

2402.1026

Country: North America > United States (0.28)

Genre: Research Report (0.63)

Industry:

Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback