[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs
Rao, Abhinav, Choudhury, Monojit, Aditya, Somak
–arXiv.org Artificial Intelligence
We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results.
arXiv.org Artificial Intelligence
Jun-20-2024
- Country:
- Asia > Indonesia
- Bali (0.04)
- Europe > France
- Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
- North America > United States
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- Massachusetts > Middlesex County
- Asia > Indonesia
- Genre:
- Research Report (0.40)
- Technology: