Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models

Nguyen, Tri, Pentapalli, Lohith Srikanth, Sieverding, Magnus, Turner, Laurah, Overla, Seth, Zheng, Weibing, Zhou, Chris, Furniss, David, Weber, Danielle, Gharib, Michael, Kelleher, Matt, Shukis, Michael, Pawlik, Cameron, Cohen, Kelly

May-2-2025–arXiv.org Artificial Intelligence

Jailbreaking in Large Language Models (LLMs) threatens their safe use in sensitive domains like education by allowing users to bypass ethical safeguards. This study focuses on detecting jailbreaks in 2-Sigma, a clinical education platform that simulates patient interactions using LLMs. We annotated over 2,300 prompts across 158 conversations using four linguistic variables shown to correlate strongly with jailbreak behavior. The extracted features were used to train several predictive models, including Decision Trees, Fuzzy Logic-based classifiers, Boosting methods, and Logistic Regression. Results show that feature-based predictive models consistently outperformed Prompt Engineering, with the Fuzzy Decision Tree achieving the best overall performance. Our findings demonstrate that linguistic-feature-based models are effective and explainable alternatives for jailbreak detection. We suggest future work explore hybrid frameworks that integrate prompt-based flexibility with rule-based robustness for real-time, spectrum-based jailbreak monitoring in educational LLMs.

arxiv preprint arxiv, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

May-2-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Michigan (0.28)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Health & Medicine (1.00)
- Education (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Decision Tree Learning (1.00)
  - Representation & Reasoning > Uncertainty
    - Fuzzy Logic (0.90)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found