Ying, Zonghao
Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings
Ying, Zonghao, Zheng, Guangyi, Huang, Yongxin, Zhang, Deyue, Zhang, Wenxin, Zou, Quanchen, Liu, Aishan, Liu, Xianglong, Tao, Dacheng
This study presents the first comprehensive safety evaluation of the DeepSeek models, focusing on evaluating the safety risks associated with their generated content. Our evaluation encompasses DeepSeek's latest generation of large language models, multimodal large language models, and text-to-image models, systematically examining their performance regarding unsafe content generation. Notably, we developed a bilingual (Chinese-English) safety evaluation dataset tailored to Chinese sociocultural contexts, enabling a more thorough evaluation of the safety capabilities of Chinese-developed models. Experimental results indicate that despite their strong general capabilities, DeepSeek models exhibit significant safety vulnerabilities across multiple risk dimensions, including algorithmic discrimination and sexual content. These findings provide crucial insights for understanding and improving the safety of large foundation models. With the rapid advancement of artificial intelligence technology, large models such as the DeepSeek series have demonstrated remarkable capabilities across multiple domains Abraham (2025); Faray de Paiva et al. (2025); Mikhail et al. (2025). These models trained on vast datasets understand and generate diverse content forms, transformatively impacting multiple industries Liu et al. (2023a; 2020a;b). Currently, the community has established multiple evaluation frameworks to test the safety performance of mainstream large models Yuan et al. (2024a;b); Röttger et al. (2024); Tang et al. (2021); Liu et al. (2023c); Guo et al. (2023). However, these evaluation standards lack consideration for China's national conditions and cultural background.
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models
Ying, Zonghao, Zhang, Deyue, Jing, Zonglei, Xiao, Yisong, Zou, Quanchen, Liu, Aishan, Liang, Siyuan, Zhang, Xiangzheng, Liu, Xianglong, Tao, Dacheng
Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation, a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs' strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to preserve attack semantics, enhance effectiveness, and sustain reasoning-driven attack progression. Extensive experiments on multiple LLMs demonstrate that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios, with attack success rates (ASRs) increasing by up to 96%. Notably, our approach achieves ASRs of 82% and 92% against leading commercial models, OpenAI o1 and DeepSeek R1, underscoring its potency. We release our code at https://github.com/NY1024/RACE to facilitate further research in this critical domain.