Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference
–Neural Information Processing Systems
Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs' probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.
Neural Information Processing Systems
Jun-22-2026, 17:16:43 GMT
- Country:
- North America > Canada > Ontario (0.28)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.66)
- Research Report
- Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Government (1.00)
- Banking & Finance (0.93)
- Law > Criminal Law (0.67)
- Health & Medicine
- Pharmaceuticals & Biotechnology (0.93)
- Therapeutic Area > Psychiatry/Psychology (0.67)
- Technology: