Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference
Zhao, Stephen, Li, Aidan, Brekelmans, Rob, Grosse, Roger
Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs' probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.
Oct-27-2025
- Country:
- Europe > Portugal (0.04)
- South America > Chile
- North America > Canada
- Asia
- Middle East > Jordan (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Government (1.00)
- Banking & Finance (1.00)
- Law > Criminal Law (0.67)
- Health & Medicine
- Pharmaceuticals & Biotechnology (0.93)
- Therapeutic Area > Psychiatry/Psychology (0.46)
- Technology: