Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models
Pan, Leyi, Fu, Zheyu, Zhai, Yunpeng, Tao, Shuchang, Guan, Sheng, Huang, Shiyu, Zhang, Lingzhe, Liu, Zhaoyang, Ding, Bolin, Henry, Felix, Liu, Aiwei, Wen, Lijie
–arXiv.org Artificial Intelligence
The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and existing benchmarks fail to assess safety under joint audio-visual inputs or cross-modal consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality variations with 972 samples each, including audio-visual harm cases. Considering OLLMs' comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on Conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) only 3 models achieving over 0.6 in both average Safety-score and CMSC-score; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Using Omni-SafetyBench, we evaluated existing safety alignment algorithms and identified key challenges in OLLM safety alignment: (1) Inference-time methods are inherently less effective as they cannot alter the model's underlying understanding of safety; (2) Post-training methods struggle with out-of-distribution issues due to the vast modality combinations in OLLMs; and, safety tasks involving audio-visual inputs are more complex, making even in-distribution training data less effective. Our proposed benchmark, metrics and the findings highlight urgent needs for enhanced OLLM safety. Omni-modal large language models (OLLMs) have advanced rapidly in understanding and generating content from integrated visual, audio, and text inputs. This enables them to handle complex tasks, such as describing audio-visual scenes or following voice instructions with visual context. Despite these advancements, ensuring their safety remains a critical concern that prevents these models from causing harm or acting in unethical, incorrect, or biased ways (Yi et al., 2024). Developing corresponding benchmarks serves as the cornerstone for reasonable assessment of safety evaluation. Numerous safety benchmarks have been established for text-only LLMs and vision-language models (Zhang et al., 2023; Liu et al., 2024b), and recent work has extended to specialized Omni-SafetyBench provides a large-scale dataset with diverse modal combinations and highlights cross-modal safety consistency as a key evaluation factor. However, for OLLMs capable of processing audio-visual joint inputs, there is currently a lack of benchmarks specifically designed to evaluate their safety.
arXiv.org Artificial Intelligence
Sep-30-2025
- Genre:
- Research Report (0.82)
- Industry:
- Law (0.46)
- Information Technology > Security & Privacy (0.46)
- Technology: