Don't Walk the Line: Boundary Guidance for Filtered Generation

Oct-15-2025–arXiv.org Artificial Intelligence

Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach. Modern AI deployment increasingly relies on compound safety systems where generative models are paired with downstream safety classifiers that filter harmful or undesirable outputs (NVIDIA Corporation, 2025; Microsoft Corporation, 2025; Sharma et al., 2025). This architecture allows organizations to maintain flexibility in their safety policies while leveraging the complementary strengths of both safety-trained models and specialized classifiers. However, current approaches focus on aligning models independently of their safety classifiers (Bai et al., 2022; Rafailov et al., 2023; Kim et al., 2025), showing a misalignment between training objectives and deployment realities.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Oct-15-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.68)

Industry:
- Information Technology (0.87)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found