AITopics

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Neural Information Processing SystemsFeb-17-2026, 21:21:22 GMT

be52acf6bccf4a8c0a90fe2f5cfcead3-Paper-Conference.pdf

arxiv preprint arxiv, harmful prompt, llm, (15 more...)

Country:

Europe > United Kingdom (0.14)
North America > United States > Texas > Travis County > Austin (0.14)
Africa (0.14)
(30 more...)

Genre:

Personal > Honors (0.92)
Research Report > Experimental Study (0.92)
Research Report > New Finding (0.67)

Industry:

Media > Music (1.00)
Media > Film (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
(9 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsFeb-8-2026, 12:47:11 GMT

Representation Noising: A Defence Mechanism Against Harmful Finetuning

Releasing open-source large language models (LLMs) presents a dual-use risk since bad actors can easily fine-tune these models for harmful purposes. Even without the open release of weights, weight stealing and fine-tuning APIs make closed models vulnerable to harmful fine-tuning attacks (HFAs).

large language model, machine learning, natural language, (19 more...)

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > Canada > Ontario > Toronto (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(5 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)

Neural Information Processing SystemsFeb-7-2026, 23:45:43 GMT

0f69b4b96a46f284b726fbd70f74fb3b-Paper-Datasets_and_Benchmarks_Track.pdf

category, language model, refusal, (14 more...)

Country:

Europe > Middle East (0.04)
Asia > Middle East (0.04)
Africa > Middle East (0.04)
(3 more...)

Genre: Research Report (0.93)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.92)
Law (0.67)
Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Communications > Social Media (0.93)

arXiv.org Machine LearningNov-27-2025

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Cho, Dongkyu Derek, Song, Huan, Chowdhury, Arijit Ghosh, An, Haotian, Wang, Yawei, Thekkanal, Rohit, Sokhandan, Negin, Keshava, Sharlina, Marlowe, Hannah

This degradation persists across standard approaches including supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RL VR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RL VR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RL VR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety-capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.

arxiv preprint arxiv, dataset, rl vr, (14 more...)

arXiv.org Machine Learning

2511.2105

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Van Doren, Madison, Ford, Casey

Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

arXiv.org Artificial IntelligenceNov-25-2025

Multimodal large language models (MLLMs) are increasingly used in real world applications, yet their safety under adversarial conditions remains underexplored. This study evaluates the harmlessness of four leading MLLMs (GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus) when exposed to adversarial prompts across text-only and multimodal formats. A team of 26 red teamers generated 726 prompts targeting three harm categories: illegal activity, disinformation, and unethical behaviour. These prompts were submitted to each model, and 17 annotators rated 2,904 model outputs for harmfulness using a 5-point scale. Results show significant differences in vulnerability across models and modalities. Pixtral 12B exhibited the highest rate of harmful responses (~62%), while Claude Sonnet 3.5 was the most resistant (~10%). Contrary to expectations, text-only prompts were slightly more effective at bypassing safety mechanisms than multimodal ones. Statistical analysis confirmed that both model type and input modality were significant predictors of harmfulness. These findings underscore the urgent need for robust, multimodal safety benchmarks as MLLMs are deployed more widely.

large language model, machine learning, natural language, (18 more...)

2509.15478

Country:

North America > United States (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (1.00)
Information Technology > Security & Privacy (0.49)
Media > News (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.38)

arXiv.org Artificial IntelligenceNov-19-2025

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Yi, Xin, Li, Yue, Shi, Dongsheng, Wang, Linlin, Wang, Xiaoling, He, Liang

Large Language Models (LLMs) are increasingly integrated into educational applications. However, they remain vulnerable to jailbreak and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. Existing studies mainly focus on general safety evaluations, with limited attention to the unique safety requirements of educational scenarios. To address this gap, we construct EduHarm, a benchmark containing safe-unsafe instruction pairs across five representative educational scenarios, enabling systematic safety evaluation of educational LLMs. Furthermore, we propose a three-stage shield framework (TSSF) for educational LLMs that simultaneously mitigates both jailbreak and fine-tuning attacks. First, safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and safe inputs. Second, layer-wise safety judgment identifies harmfulness features by aggregating safety cues across multiple layers to detect unsafe instructions. Finally, defense-driven dual routing separates safe and unsafe queries, ensuring normal processing for benign inputs and guarded responses for harmful ones. Extensive experiments across eight jailbreak attack strategies demonstrate that TSSF effectively strengthens safety while preventing over-refusal of benign queries. Evaluations on three fine-tuning attack datasets further show that it consistently achieves robust defense against harmful queries while maintaining preserving utility gains from benign fine-tuning.

large language model, machine learning, natural language, (16 more...)

2511.14423

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Instructional Material (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Education > Educational Technology > Educational Software (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Kim, Heehwan, Park, Sungjune, Choi, Daeseon

Self-HarmLLM: Can Large Language Model Harm Itself?

arXiv.org Artificial IntelligenceNov-13-2025

Large Language Models (LLMs) are generally equipped with guardrails to block the generation of harmful responses. However, existing defenses always assume that an external attacker crafts the harmful query, and the possibility of a model's own output becoming a new attack vector has not been sufficiently explored. In this study, we propose the Self-HarmLLM scenario, which uses a Mitigated Harmful Query (MHQ) generated by the same model as a new input. An MHQ is an ambiguous query whose original intent is preserved while its harmful nature is not directly exposed. We verified whether a jailbreak occurs when this MHQ is re-entered into a separate session of the same model. We conducted experiments on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions. The results showed up to 52% transformation success rate and up to 33% jailbreak success rate in the Zero-shot condition, and up to 65% transformation success rate and up to 41% jailbreak success rate in the Few-shot condition. By performing both prefix-based automated evaluation and human evaluation, we found that the automated evaluation consistently overestimated jailbreak success, with an average difference of 52%. This indicates that automated evaluation alone is not accurate for determining harmfulness. While this study is a toy-level study based on a limited query set and evaluators, it proves that our method can still be a valid attack scenario. These results suggest the need for a fundamental reconsideration of guardrail design and the establishment of a more robust evaluation methodology.

large language model, machine learning, natural language, (17 more...)

2511.08597

Country: Asia > South Korea (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Schaffelder, Max, Gatt, Albert

Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

arXiv.org Artificial IntelligenceNov-4-2025

As synthetic data becomes widely used in language model development, understanding its impact on model behavior is crucial. This paper investigates the impact of the diversity of sources of synthetic data on fine-tuned large language models. We focus on three key dimensions: distribution collapse, adversarial robustness, and self-preference bias. Our findings reveal that fine-tuning models on synthetic data from diverse sources can mitigate distribution collapse, preserving the breadth of the output distribution and the diversity of the output text. Furthermore, while both human and synthetic fine-tuning data can remove safeguards, the latter preserves higher output quality, thus making outputs potentially more usable and dangerous. Finally, fine-tuning reduces self-preference bias, with human data being the most effective, followed by multi-source synthetic data.

large language model, machine learning, natural language, (23 more...)

2511.0149

Country:

Asia (0.46)
Europe > Austria (0.28)
North America > Mexico (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceNov-3-2025

MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models

Chen, Zixin, Lin, Hongzhan, Li, Kaixin, Luo, Ziyang, Deng, Yayue, Ma, Jing

The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to effectively understand multimodal harmfulness. Existing evaluation approaches predominantly focus on mLLMs' detection accuracy for binary classification tasks, which often fail to reflect the in-depth interpretive nuance of harmfulness across diverse contexts. In this paper, we propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs' understanding of multimodal harmfulness. Specifically, MemeArena simulates diverse interpretive contexts to formulate evaluation tasks that elicit perspective-specific analyses from mLLMs. By integrating varied viewpoints and reaching consensus among evaluators, it enables fair and unbiased comparisons of mLLMs' abilities to interpret multimodal harmfulness. Extensive experiments demonstrate that our framework effectively reduces the evaluation biases of judge agents, with judgment results closely aligning with human preferences, offering valuable insights into reliable and comprehensive mLLM evaluations in multimodal harmfulness understanding. Our code and data are publicly available at https://github.com/Lbotirx/MemeArena.

large language model, machine learning, natural language, (19 more...)

2510.27196

Country:

Europe > Austria (0.28)
Asia > China (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (0.67)

Industry:

Leisure & Entertainment (0.67)
Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)