The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models

Xu, Zhiyuan, Gardiner, Joseph, Belguith, Sana

arXiv.org Artificial Intelligence 

As one of the few Chain-of-Thought (CoT) reasoning models--and notably the first open-source implementation of its kind--DeepSeek-R1 has demonstrated remarkable improvements in the performance of complex reasoning tasks. Experimental results show that DeepSeek-R1 not only achieves CoT reasoning but also significantly reduces computational resource requirements [1]. Furthermore, it has outperformed comparable models, such as ChatGPT-o1, in certain benchmark tests, showcasing exceptional performance advantages. However, while the CoT approach significantly enhances reasoning capabilities, it also brings forth security concerns that warrant attention. Due to the influence of scaling laws, the volume of data used during the training of LLMs has reached unprecedented levels. Although extensive methods have been employed to sanitize the data during collection and filtering [2], technical limitations and resource constraints have resulted in a considerable amount of harmful content remaining in the training data.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found