BadGPT-4o: stripping safety finetuning from GPT models
Krupkina, Ekaterina, Volkov, Dmitrii
–arXiv.org Artificial Intelligence
LLM vendors expend substantial effort to secure their models and make them unhelpful to adversaries like cybercriminals (Touvron et al. 2023, Section 4.3) (OpenAI et al. 2024, Section 3) (OpenAI 2024a). However, LLMs have been repeatedly "jailbroken" out of these constraints (Chao et al. 2024; Mazeika et al. 2024; Souly et al. 2024). No robust LLM security measures are known. Classic jailbreaks encode LLM prompts to bypass model safeguards. They tend to be unstable, add a token overhead, and reduce model performance (Chao et al. 2024; Mazeika et al. 2024; Souly et al. 2024).
arXiv.org Artificial Intelligence
Dec-6-2024