ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails

Wen, Xiaofei, Zhou, Wenxuan, Mo, Wenjie Jacky, Chen, Muhao

Feb-19-2025–arXiv.org Artificial Intelligence

Ensuring the safety of large language models (LLMs) is critical as they are deployed in real-world applications. Existing guardrails rely on rule-based filtering or single-pass classification, limiting their ability to handle nuanced safety violations. To address this, we propose ThinkGuard, a critique-augmented guardrail model that distills knowledge from high-capacity LLMs by generating structured critiques alongside safety labels. Fine-tuned on critique-augmented data, the captured deliberative thinking ability drastically enhances the guardrail's cautiousness and interpretability. Evaluated on multiple safety benchmarks, ThinkGuard achieves the highest average F1 and AUPRC, outperforming all baselines. Compared to LLaMA Guard 3, ThinkGuard improves accuracy by 16.1% and macro F1 by 27.0%. Moreover, it surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning while maintaining computational efficiency.

category, guardrail model, language model, (14 more...)

arXiv.org Artificial Intelligence

Feb-19-2025

arXiv.org PDF

Add feedback

Country:
- South America > Brazil
  - Rio de Janeiro > Rio de Janeiro (0.04)
- North America
  - Dominican Republic (0.04)
  - United States
    - Utah > Salt Lake County
      - Salt Lake City (0.04)
    - Texas > Dallas County
      - Dallas (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Florida > Miami-Dade County
      - Miami (0.04)
    - California > Yolo County
      - Davis (0.04)
  - Canada
    - Ontario > Toronto (0.04)
    - British Columbia > Vancouver (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Germany > Hamburg (0.04)
- Asia
  - Singapore (0.04)
  - India (0.04)

Genre:
- Research Report (0.82)

Industry:
- Law (1.00)
- Information Technology (0.67)
- Health & Medicine > Therapeutic Area (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found