ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models
Banerjee, Somnath, Layek, Sayan, Adak, Sayantan, Pechenizkiy, Mykola, Mukherjee, Animesh, Hazra, Rima
–arXiv.org Artificial Intelligence
Current language model safety paradigms often fall short in emotionally charged or high-stakes settings, where refusal-only approaches may alienate users and naive compliance can amplify risk. We propose ProSocialAlign, a test-time, parameter-efficient framework that steers generation toward safe, empathetic, and value-aligned responses without retraining the base model. We formalize five human-centered objectives and cast safety as lexicographic constrained generation: first, applying hard constraints to eliminate harmful continuations; then optimizing for prosocial quality within the safe set. Our method combines (i) directional regulation, a harm-mitigation mechanism that subtracts a learned "harm vector" in parameter space, and (ii) preference-aware autoregressive reward modeling trained jointly across attributes with gradient conflict resolution, enabling fine-grained, user-controllable decoding. Empirical evaluations across five safety benchmarks demonstrate state-of-the-art performance, reducing unsafe leakage and boosting alignment to human values, with strong gains across multiple evaluation metrics. ProSocialAlign offers a robust and modular foundation for generating context-sensitive, safe, and human-aligned responses at inference time.
arXiv.org Artificial Intelligence
Dec-9-2025
- Country:
- Asia
- Europe
- Austria > Vienna (0.14)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Netherlands > North Brabant
- Eindhoven (0.04)
- North America
- Dominican Republic (0.04)
- United States > Florida
- Miami-Dade County > Miami (0.04)
- South America
- Genre:
- Research Report (1.00)
- Industry:
- Health & Medicine
- Consumer Health (0.68)
- Therapeutic Area > Psychiatry/Psychology
- Mental Health (0.46)
- Law (1.00)
- Law Enforcement & Public Safety (0.68)
- Health & Medicine
- Technology: