llama guard
Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers
Safety classifiers deployed in production operate under a stationarity assumption that fails silently: when input distributions drift, accuracy degrades with no error signal until ground-truth labels arrive. We present an online monitor that detects distributional shift in classifier scores via a sliding-window KS statistic with empirically calibrated alarm thresholds. In a pre-registered factorial evaluation (4 classifiers $\times$ 5 shift conditions $\times$ 20 seeds $\times$ 2 window sizes; 800 cells), the monitor achieves 86.6% valid detection (mean latency 39.5 steps) across synthetic-onset, real-jailbreak, and adversarial regimes; a classifier $\times$ shift interaction ($ฮท^2 = 0.185$) shows that monitoring must be tuned per classifier. Attempting to recover post-detection coverage via weighted conformal prediction exposes a failure mode: density-ratio estimation collapses for generative classifiers because logistic regression separates source from target perfectly in 3584-4096-dimensional embedding space, clipping all importance weights to zero; projecting to $\leq 32$ dimensions restores coverage. We then extend the framework to gradient-based evasion and give the first threat-model characterisation of score-disagreement monitoring as a canary. We falsify three assumptions: that architectural diversity drives the signal (false, $ฮท^2 = 0.011$), that it is generic out-of-distribution detection (false, GCG-specific, $p < 10^{-12}$), and that an adaptive attacker can suppress it (false while the canary is confident). We derive the exact security boundary, a confidence-gated equilibrium at which a monitor-aware attacker stalls at gap $= 1/(2ฮป)$, and provide a calibration-free scan martingale achieving false-alarm rate $\leq 1\%$ across all classifiers with no per-model tuning.
Supplemental Material For GenAI Arena Dongfu Jiang Max Ku Tianle Li
For what purpose was the dataset created? To foster the research in aligning diffusion models further and analyze the user preferences. Who created the dataset (e.g., which team, research group) and on behalf of which entity Who funded the creation of the dataset? What do the instances that comprise the dataset represent (e.g., documents, photos, people, How many instances are there in total (of each type, if appropriate)? What data does each instance consist of?
Token-Level Marginalization for Multi-Label LLM Classifiers
Praharaj, Anjaneya, Kasundra, Jaykumar
This paper addresses the critical challenge of deriving interpretable confidence scores from generative language models (LLMs) when applied to multi-label content safety classification. While models like LLaMA Guard are effective for identifying unsafe content and its categories, their generative architecture inherently lacks direct class-level probabilities, which hinders model confidence assessment and performance interpretation. This limitation complicates the setting of dynamic thresholds for content moderation and impedes fine-grained error analysis. This research proposes and evaluates three novel token-level probability estimation approaches to bridge this gap. The aim is to enhance model interpretability and accuracy, and evaluate the generalizability of this framework across different instruction-tuned models. Through extensive experimentation on a synthetically generated, rigorously annotated dataset, it is demonstrated that leveraging token logits significantly improves the interpretability and reliability of generative classifiers, enabling more nuanced content safety moderation.
Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses
Ahmed, Mohamed, Abdelmouty, Mohamed, Kim, Mingyu, Kandula, Gunvanth, Park, Alex, Davis, James C.
--The advancement of Pre-Trained Language Models (PTLMs) and Large Language Models (LLMs) has led to their widespread adoption across diverse applications. Despite their success, these models remain vulnerable to attacks that exploit their inherent weaknesses to bypass safety measures. Two primary inference-phase threats are token-level and prompt-level jailbreaks. T oken-level attacks embed adversarial sequences that transfer well to black-box models like GPT but leave detectable patterns and rely on gradient-based token optimization, whereas prompt-level attacks use semantically structured inputs to elicit harmful responses yet depend on iterative feedback that can be unreliable. T o address the complementary limitations of these methods, we propose two hybrid approaches that integrate token-and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs. GCG + PAIR and the newly explored GCG + WordGame hybrids were evaluated across multiple Vicuna and Llama models. GCG + PAIR consistently raised attack-success rates over its constituent techniques on undefended models; for instance, on Llama-3, its Attack Success Rate (ASR) reached 91.6%, a substantial increase from PAIR's 58.4% baseline. Meanwhile, GCG + WordGame matched the raw performance of WordGame maintaining a high ASR of over 80% even under stricter evaluators like Mistral-Sorry-Bench. Crucially, both hybrids retained transferability and reliably pierced advanced defenses such as Gradient Cuff and JBShield, which fully blocked single-mode attacks. These findings expose previously unreported vulnerabilities in current safety stacks, highlight trade-offs between raw success and defensive robustness, and underscore the need for holistic safeguards against adaptive adversaries. Large Language Models (LLMs)--such as GPT -4, LLaMA, and Claude--have become indispensable in healthcare, finance, education, and other high-stakes domains [1]-[3]. Their ability to understand context, generate human-like responses, and adapt to diverse tasks fuels widespread deployment. Y et these same models remain vulnerable to jailbreak attacks, which exploit weaknesses in alignment mechanisms to induce harmful or disallowed content [4].
PL-Guard: Benchmarking Language Model Safety for Polish
Krasnodฤbska, Aleksandra, Seweryn, Karolina, ลukasik, Szymon, Kusa, Wojciech
Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.
STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models
Wang, Xunguang, Wang, Wenxuan, Ji, Zhenlan, Li, Zongjie, Ma, Pingchuan, Wu, Daoyuan, Wang, Shuai
Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models
Bachu, Saketh, Shayegani, Erfan, Chakraborty, Trishna, Lal, Rohit, Dutta, Arindam, Song, Chengyu, Dong, Yue, Abu-Ghazaleh, Nael, Roy-Chowdhury, Amit K.
Vision-language models (VLMs) have improved significantly in multi-modal tasks, but their more complex architecture makes their safety alignment more challenging than the alignment of large language models (LLMs). In this paper, we reveal an unfair distribution of safety across the layers of VLM's vision encoder, with earlier and middle layers being disproportionately vulnerable to malicious inputs compared to the more robust final layers. This 'cross-layer' vulnerability stems from the model's inability to generalize its safety training from the default architectural settings used during training to unseen or out-of-distribution scenarios, leaving certain layers exposed. We conduct a comprehensive analysis by projecting activations from various intermediate layers and demonstrate that these layers are more likely to generate harmful outputs when exposed to malicious inputs. Our experiments with LLaVA-1.5 and Llama 3.2 show discrepancies in attack success rates and toxicity scores across layers, indicating that current safety alignment strategies focused on a single default layer are insufficient.