Zhang, Wenxin
Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings
Ying, Zonghao, Zheng, Guangyi, Huang, Yongxin, Zhang, Deyue, Zhang, Wenxin, Zou, Quanchen, Liu, Aishan, Liu, Xianglong, Tao, Dacheng
This study presents the first comprehensive safety evaluation of the DeepSeek models, focusing on evaluating the safety risks associated with their generated content. Our evaluation encompasses DeepSeek's latest generation of large language models, multimodal large language models, and text-to-image models, systematically examining their performance regarding unsafe content generation. Notably, we developed a bilingual (Chinese-English) safety evaluation dataset tailored to Chinese sociocultural contexts, enabling a more thorough evaluation of the safety capabilities of Chinese-developed models. Experimental results indicate that despite their strong general capabilities, DeepSeek models exhibit significant safety vulnerabilities across multiple risk dimensions, including algorithmic discrimination and sexual content. These findings provide crucial insights for understanding and improving the safety of large foundation models. With the rapid advancement of artificial intelligence technology, large models such as the DeepSeek series have demonstrated remarkable capabilities across multiple domains Abraham (2025); Faray de Paiva et al. (2025); Mikhail et al. (2025). These models trained on vast datasets understand and generate diverse content forms, transformatively impacting multiple industries Liu et al. (2023a; 2020a;b). Currently, the community has established multiple evaluation frameworks to test the safety performance of mainstream large models Yuan et al. (2024a;b); Rรถttger et al. (2024); Tang et al. (2021); Liu et al. (2023c); Guo et al. (2023). However, these evaluation standards lack consideration for China's national conditions and cultural background.
Decoupled Alignment for Robust Plug-and-Play Adaptation
Luo, Haozheng, Yu, Jiahao, Zhang, Wenxin, Li, Jialong, Hu, Jerry Yao-Chieh, Xing, Xinyu, Liu, Han
This innovation is practically urgent and important. LLMs have been widely adopted in various applications recently, demonstrating their ability to generate high-quality human-like texts [Team et al., 2024, Touvron et al., 2023, Ivison et al., 2023]. However, the security of these models has become a significant concern due to the potential risks of generating harmful content [Wu et al., 2024a, Yu et al., 2024, 2023a, Chao et al., 2023, Deng et al., 2023]. To align the LLMs with ethical guidelines, researchers have developed various methods to enhance their safety. For example, the Llama-2-Chat [Touvron et al., 2023] and Gemma-it [Team et al., 2024] models have been extensively fine-tuned to improve their alignment performance. However, these methods often require extensive computational resources or manual red-teaming, which can be costly and time-consuming [Team et al., 2024, OpenAI, 2024, Bai et al., 2022, Ganguli et al., 2022]. Thus, most of the LLMs finetuned from the pre-trained models by third-party developers do not undergo the alignment process [Xu et al., 2024a, Chiang et al., 2023, Ivison et al., 2023], leaving them vulnerable to generating harmful content by users with malicious intent. To combat these issues, we seek motivations from knowledge distillation technologies [Xu et al., 2024b, Hahn and Choi, 2019], where a teacher model's knowledge is transferred to a student model. Specifically, through numerical experiments Figure 3 and Figure 4, we make two key detections: MLP Alignment.
SOTOPIA-$\pi$: Interactive Learning of Socially Intelligent Language Agents
Wang, Ruiyi, Yu, Haofei, Zhang, Wenxin, Qi, Zhengyang, Sap, Maarten, Neubig, Graham, Bisk, Yonatan, Zhu, Hao
Humans learn social skills through both imitation and social interaction. This social learning process is largely understudied by existing research on building language agents. Motivated by this gap, we propose an interactive learning method, SOTOPIA-$\pi$, improving the social intelligence of language agents. This method leverages behavior cloning and self-reinforcement training on filtered social interaction data according to large language model (LLM) ratings. We show that our training method allows a 7B LLM to reach the social goal completion ability of an expert model (GPT-4-based agent), while improving the safety of language agents and maintaining general QA ability on the MMLU benchmark. We also find that this training paradigm uncovers some difficulties in LLM-based evaluation of social intelligence: LLM-based evaluators overestimate the abilities of the language agents trained specifically for social interaction.