AITopics

Neural Information Processing SystemsFeb-16-2026, 16:13:37 GMT

Supplemental Material For GenAI Arena Dongfu Jiang Max Ku Tianle Li

For what purpose was the dataset created? To foster the research in aligning diffusion models further and analyze the user preferences. Who created the dataset (e.g., which team, research group) and on behalf of which entity Who funded the creation of the dataset? What do the instances that comprise the dataset represent (e.g., documents, photos, people, How many instances are there in total (of each type, if appropriate)? What data does each instance consist of?

artificial intelligence, dataset, machine learning, (14 more...)

Industry: Law (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.69)

Neural Information Processing SystemsFeb-16-2026, 04:31:26 GMT

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Current methods for identifying adversarial prompts aimed at "attacking" LLMs and eliciting undesirable outputs are limited by several factors.

large language model, machine learning, natural language, (21 more...)

Country:

Pacific Ocean (0.04)
North America > United States > Oregon (0.04)
North America > Canada > Quebec (0.04)
(7 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)
Law (0.67)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Praharaj, Anjaneya, Kasundra, Jaykumar

Token-Level Marginalization for Multi-Label LLM Classifiers

arXiv.org Artificial IntelligenceDec-1-2025

This paper addresses the critical challenge of deriving interpretable confidence scores from generative language models (LLMs) when applied to multi-label content safety classification. While models like LLaMA Guard are effective for identifying unsafe content and its categories, their generative architecture inherently lacks direct class-level probabilities, which hinders model confidence assessment and performance interpretation. This limitation complicates the setting of dynamic thresholds for content moderation and impedes fine-grained error analysis. This research proposes and evaluates three novel token-level probability estimation approaches to bridge this gap. The aim is to enhance model interpretability and accuracy, and evaluate the generalizability of this framework across different instruction-tuned models. Through extensive experimentation on a synthetically generated, rigorously annotated dataset, it is demonstrated that leveraging token logits significantly improves the interpretability and reliability of generative classifiers, enabling more nuanced content safety moderation.

large language model, machine learning, natural language, (18 more...)

2511.22312

Country:

North America > United States > New Mexico (0.15)
North America > Mexico > Mexico City (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Neural Information Processing SystemsOct-10-2025, 09:40:15 GMT

92249f9233286e437f808fa535d88b26-Supplemental-Datasets_and_Benchmarks_Track.pdf

dataset, huggingface, platform, (13 more...)

Industry: Law (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.49)

Neural Information Processing SystemsOct-10-2025, 07:35:56 GMT

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

adversarial prompt, archive, eaming, (16 more...)

Country:

Pacific Ocean (0.04)
North America > United States > Oregon (0.04)
North America > Canada > Quebec (0.04)
(8 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (1.00)
Health & Medicine (0.68)
Law (0.67)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

arXiv.org Artificial IntelligenceJun-30-2025

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Ahmed, Mohamed, Abdelmouty, Mohamed, Kim, Mingyu, Kandula, Gunvanth, Park, Alex, Davis, James C.

--The advancement of Pre-Trained Language Models (PTLMs) and Large Language Models (LLMs) has led to their widespread adoption across diverse applications. Despite their success, these models remain vulnerable to attacks that exploit their inherent weaknesses to bypass safety measures. Two primary inference-phase threats are token-level and prompt-level jailbreaks. T oken-level attacks embed adversarial sequences that transfer well to black-box models like GPT but leave detectable patterns and rely on gradient-based token optimization, whereas prompt-level attacks use semantically structured inputs to elicit harmful responses yet depend on iterative feedback that can be unreliable. T o address the complementary limitations of these methods, we propose two hybrid approaches that integrate token-and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs. GCG + PAIR and the newly explored GCG + WordGame hybrids were evaluated across multiple Vicuna and Llama models. GCG + PAIR consistently raised attack-success rates over its constituent techniques on undefended models; for instance, on Llama-3, its Attack Success Rate (ASR) reached 91.6%, a substantial increase from PAIR's 58.4% baseline. Meanwhile, GCG + WordGame matched the raw performance of WordGame maintaining a high ASR of over 80% even under stricter evaluators like Mistral-Sorry-Bench. Crucially, both hybrids retained transferability and reliably pierced advanced defenses such as Gradient Cuff and JBShield, which fully blocked single-mode attacks. These findings expose previously unreported vulnerabilities in current safety stacks, highlight trade-offs between raw success and defensive robustness, and underscore the need for holistic safeguards against adaptive adversaries. Large Language Models (LLMs)--such as GPT -4, LLaMA, and Claude--have become indispensable in healthcare, finance, education, and other high-stakes domains [1]-[3]. Their ability to understand context, generate human-like responses, and adapt to diverse tasks fuels widespread deployment. Y et these same models remain vulnerable to jailbreak attacks, which exploit weaknesses in alignment mechanisms to induce harmful or disallowed content [4].

large language model, machine learning, natural language, (19 more...)

2506.21972

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Krasnodębska, Aleksandra, Seweryn, Karolina, Łukasik, Szymon, Kusa, Wojciech

PL-Guard: Benchmarking Language Model Safety for Polish

arXiv.org Artificial IntelligenceJun-23-2025

Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.

large language model, machine learning, natural language, (17 more...)

2506.16322

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
Europe > Ukraine (0.04)
(2 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.68)
Information Technology (0.68)
Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMar-23-2025

STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

Wang, Xunguang, Wang, Wenxuan, Ji, Zhenlan, Li, Zongjie, Ma, Pingchuan, Wu, Daoyuan, Wang, Shuai

Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.

large language model, machine learning, natural language, (19 more...)

2503.17932

Country: Asia > China > Hong Kong (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceNov-6-2024

Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models

Bachu, Saketh, Shayegani, Erfan, Chakraborty, Trishna, Lal, Rohit, Dutta, Arindam, Song, Chengyu, Dong, Yue, Abu-Ghazaleh, Nael, Roy-Chowdhury, Amit K.

Vision-language models (VLMs) have improved significantly in multi-modal tasks, but their more complex architecture makes their safety alignment more challenging than the alignment of large language models (LLMs). In this paper, we reveal an unfair distribution of safety across the layers of VLM's vision encoder, with earlier and middle layers being disproportionately vulnerable to malicious inputs compared to the more robust final layers. This 'cross-layer' vulnerability stems from the model's inability to generalize its safety training from the default architectural settings used during training to unseen or out-of-distribution scenarios, leaving certain layers exposed. We conduct a comprehensive analysis by projecting activations from various intermediate layers and demonstrate that these layers are more likely to generate harmful outputs when exposed to malicious inputs. Our experiments with LLaVA-1.5 and Llama 3.2 show discrepancies in attack success rates and toxicity scores across layers, indicating that current safety alignment strategies focused on a single default layer are insufficient.

alignment, intermediate layer, vision encoder, (13 more...)

2411.04291

Country: North America > United States > California > Riverside County > Riverside (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (0.69)
Law > Criminal Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)