vicuna-13b-v1
Imperceptible Jailbreaking against Large Language Models
Gao, Kuofeng, Li, Yiming, Du, Chao, Wang, Xin, Ma, Xingjun, Xia, Shu-Tao, Pang, Tianyu
Jailbreaking attacks on the vision modality typically rely on imperceptible adversarial perturbations, whereas attacks on the textual modality are generally assumed to require visible modifications (e.g., non-semantic suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a class of Unicode characters called variation selectors. By appending invisible variation selectors to malicious questions, the jailbreak prompts appear visually identical to original malicious questions on screen, while their tokenization is "secretly" altered. We propose a chain-of-search pipeline to generate such adversarial suffixes to induce harmful responses. Our experiments show that our imperceptible jailbreaks achieve high attack success rates against four aligned LLMs and generalize to prompt injection attacks, all without producing any visible modifications in the written prompt. Large Language Models (LLMs) (Jiang et al., 2023; Dubey et al., 2024) have demonstrated susceptibility to jailbreaking attacks that can manipulate LLMs to generate harmful outputs. While jailbreaking attacks (Qi et al., 2024) on images generally adopt imperceptible adversarial perturbations, existing textual jailbreaking attacks (Zou et al., 2023; Andriushchenko et al., 2025) operate under an implicit assumption that jailbreak prompts are constructed by visibly modifying malicious questions. Specifically, whether these methods rely on manually designed prompt templates (Shen et al., 2023; Wei et al., 2023a) or automated algorithms (Zou et al., 2023; Jia et al., 2025), they consistently involve the insertion of human-perceptible characters into the original malicious questions. In this paper, we introduce imperceptible jailbreaks by using a set of Unicode characters, i.e., variation selectors (Butler, 2025). V ariation selectors are originally designed to specify glyph variants for some special characters, such as changing emojis in different colors. Instead, we demonstrate that they can be repurposed to form invisible adversarial suffixes appended to malicious questions for jailbreaks. While these characters are imperceptible on screen, they occupy textual space that tokenizers of LLMs can encode.
- Workflow (0.94)
- Research Report > New Finding (0.46)
SafeLawBench: Towards Safe Alignment of Large Language Models
Cao, Chuxue, Zhu, Han, Ji, Jiaming, Sun, Qichao, Zhu, Zhenghao, Wu, Yinyu, Dai, Juntao, Yang, Yaodong, Han, Sirui, Guo, Yike
With the growing prevalence of large language models (LLMs), the safety of LLMs has raised significant concerns. However, there is still a lack of definitive standards for evaluating their safety due to the subjective nature of current safety benchmarks. To address this gap, we conducted the first exploration of LLMs' safety evaluation from a legal perspective by proposing the SafeLawBench benchmark. SafeLawBench categorizes safety risks into three levels based on legal standards, providing a systematic and comprehensive framework for evaluation. It comprises 24,860 multi-choice questions and 1,106 open-domain question-answering (QA) tasks. Our evaluation included 2 closed-source LLMs and 18 open-source LLMs using zero-shot and few-shot prompting, highlighting the safety features of each model. We also evaluated the LLMs' safety-related reasoning stability and refusal behavior. Additionally, we found that a majority voting mechanism can enhance model performance. Notably, even leading SOTA models like Claude-3.5-Sonnet and GPT-4o have not exceeded 80.5% accuracy in multi-choice tasks on SafeLawBench, while the average accuracy of 20 LLMs remains at 68.8\%. We urge the community to prioritize research on the safety of LLMs.
- Asia > China > Hong Kong (0.05)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (6 more...)
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs
Zizzo, Giulio, Cornacchia, Giandomenico, Fraser, Kieran, Hameed, Muhammad Zaid, Rawat, Ambrish, Buesser, Beat, Purcell, Mark, Chen, Pin-Yu, Sattigeri, Prasanna, Varshney, Kush
As large language models (LLMs) become integrated into everyday applications, ensuring their robustness and security is increasingly critical. In particular, LLMs can be manipulated into unsafe behaviour by prompts known as jailbreaks. The variety of jailbreak styles is growing, necessitating the use of external defences known as guardrails. While many jailbreak defences have been proposed, not all defences are able to handle new out-of-distribution attacks due to the narrow segment of jailbreaks used to align them. Moreover, the lack of systematisation around defences has created significant gaps in their practical application. In this work, we perform systematic benchmarking across 15 different defences, considering a broad swathe of malicious and benign datasets. We find that there is significant performance variation depending on the style of jailbreak a defence is subject to. Additionally, we show that based on current datasets available for evaluation, simple baselines can display competitive out-of-distribution performance compared to many state-of-the-art defences. Code is available at https://github.com/IBM/Adversarial-Prompt-Evaluation.
- Europe > Switzerland > Basel-City > Basel (0.04)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- (2 more...)
- Information Technology > Security & Privacy (0.68)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)
- Media > News (0.46)
- Health & Medicine > Consumer Health (0.46)
Towards Multilingual LLM Evaluation for European Languages
Thellmann, Klaudia, Stadler, Bernhard, Fromm, Michael, Buschhoff, Jasper Schulze, Jude, Alex, Barth, Fabio, Leveling, Johannes, Flores-Herr, Nicolas, Köhler, Joachim, Jäkel, René, Ali, Mehdi
The rise of Large Language Models (LLMs) has revolutionized natural language processing across numerous languages and tasks. However, evaluating LLM performance in a consistent and meaningful way across multiple European languages remains challenging, especially due to the scarcity of language-parallel multilingual benchmarks. We introduce a multilingual evaluation approach tailored for European languages. We employ translated versions of five widely-used benchmarks to assess the capabilities of 40 LLMs across 21 European languages. Our contributions include examining the effectiveness of translated benchmarks, assessing the impact of different translation services, and offering a multilingual evaluation framework for LLMs that includes newly created datasets: EU20-MMLU, EU20-HellaSwag, EU20-ARC, EU20-TruthfulQA, and EU20-GSM8K. The benchmarks and results are made publicly available to encourage further research in multilingual LLM evaluation.
- Asia > Singapore (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States (0.04)
- (11 more...)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.92)
- Health & Medicine (0.45)
- Leisure & Entertainment (0.45)
ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees
Wang, Zhiyuan, Duan, Jinhao, Cheng, Lu, Zhang, Yue, Wang, Qingni, Shen, Hengtao, Zhu, Xiaofeng, Shi, Xiaoshuang, Xu, Kaidi
Uncertainty quantification (UQ) in natural language generation (NLG) tasks remains an open challenge, exacerbated by the intricate nature of the recent large language models (LLMs). This study investigates adapting conformal prediction (CP), which can convert any heuristic measure of uncertainty into rigorous theoretical guarantees by constructing prediction sets, for black-box LLMs in open-ended NLG tasks. We propose a sampling-based uncertainty measure leveraging self-consistency and develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the design of the CP algorithm. Experimental results indicate that our uncertainty measure generally surpasses prior state-of-the-art methods. Furthermore, we calibrate the prediction sets within the model's unfixed answer distribution and achieve strict control over the correctness coverage rate across 6 LLMs on 4 free-form NLG datasets, spanning general-purpose and medical domains, while the small average set size further highlights the efficiency of our method in providing trustworthy guarantees for practical open-ended NLG applications.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Asia > China (0.04)
Evaluating the Performance of Large Language Models via Debates
Moniri, Behrad, Hassani, Hamed, Dobriban, Edgar
Large Language Models (LLMs) are rapidly evolving and impacting various fields, necessitating the development of effective methods to evaluate and compare their performance. Most current approaches for performance evaluation are either based on fixed, domain-specific questions that lack the flexibility required in many real-world applications where tasks are not always from a single domain, or rely on human input, making them unscalable. We propose an automated benchmarking framework based on debates between LLMs, judged by another LLM. This method assesses not only domain knowledge, but also skills such as problem definition and inconsistency recognition. We evaluate the performance of various state-of-the-art LLMs using the debate framework and achieve rankings that align closely with popular rankings based on human input, eliminating the need for costly human crowdsourcing.
- North America > United States > Pennsylvania (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Health & Medicine (0.92)
- Leisure & Entertainment > Sports (0.67)
- Education > Educational Setting (0.46)
- Government > Regional Government (0.45)
Defending LLMs against Jailbreaking Attacks via Backtranslation
Wang, Yihan, Shi, Zhouxing, Bai, Andrew, Hsieh, Cho-Jui
Although many large language models (LLMs) have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks which rewrite the original prompt to conceal its harmful intent. In this paper, we propose a new method for defending LLMs against jailbreaking attacks by ``backtranslation''. Specifically, given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and not directly manipulated by the attacker. We then run the target LLM again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. We explain that the proposed defense provides several benefits on its effectiveness and efficiency. We empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts. Our implementation is based on our library for LLM jailbreaking defense algorithms at \url{https://github.com/YihanWang617/llm-jailbreaking-defense}, and the code for reproducing our experiments is available at \url{https://github.com/YihanWang617/LLM-Jailbreaking-Defense-Backtranslation}.
- Information Technology > Security & Privacy (1.00)
- Law (0.68)
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
Sun, Jiaxing, Huang, Weiquan, Wu, Jiang, Gu, Chenya, Li, Wei, Zhang, Songyang, Yan, Hang, He, Conghui
We introduce CHARM, the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and Chinese-specific commonsense. We evaluated 7 English and 12 Chinese-oriented LLMs on CHARM, employing 5 representative prompt strategies for improving LLMs' reasoning ability, such as Chain-of-Thought. Our findings indicate that the LLM's language orientation and the task's domain influence the effectiveness of the prompt strategy, which enriches previous research findings. We built closely-interconnected reasoning and memorization tasks, and found that some LLMs struggle with memorizing Chinese commonsense, affecting their reasoning ability, while others show differences in reasoning despite similar memorization performance. We also evaluated the LLMs' memorization-independent reasoning abilities and analyzed the typical errors. Our study precisely identified the LLMs' strengths and weaknesses, providing the clear direction for optimization. It can also serve as a reference for studies in other fields. We will release CHARM at https://github.com/opendatalab/CHARM .
- Leisure & Entertainment (1.00)
- Education (0.67)
- Media > Film (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks
Wang, Chonghua, Duan, Haodong, Zhang, Songyang, Lin, Dahua, Chen, Kai
Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at https://github.com/open-compass/Ada-LEval.
- Asia > China > Shanghai > Shanghai (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Position-Aware Parameter Efficient Fine-Tuning Approach for Reducing Positional Bias in LLMs
Zhang, Zheng, Yang, Fan, Jiang, Ziyan, Chen, Zheng, Zhao, Zhengyang, Ma, Chengyuan, Zhao, Liang, Liu, Yang
Recent advances in large language models (LLMs) have enhanced their ability to process long input contexts. This development is particularly crucial for tasks that involve retrieving knowledge from an external datastore, which can result in long inputs. However, recent studies show a positional bias in LLMs, demonstrating varying performance depending on the location of useful information within the input sequence. In this study, we conduct extensive experiments to investigate the root causes of positional bias. Our findings indicate that the primary contributor to LLM positional bias stems from the inherent positional preferences of different models. We demonstrate that merely employing prompt-based solutions is inadequate for overcoming the positional preferences. To address this positional bias issue of a pre-trained LLM, we developed a Position-Aware Parameter Efficient Fine-Tuning (PAPEFT) approach which is composed of a data augmentation technique and a parameter efficient adapter, enhancing a uniform attention distribution across the input context. Our experiments demonstrate that the proposed approach effectively reduces positional bias, improving LLMs' effectiveness in handling long context sequences for various tasks that require externally retrieved knowledge.
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Asia > Middle East > Jordan (0.04)