Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations
–Neural Information Processing Systems
In reinforcement learning with sparse rewards, demonstrations can accelerate learning, but determining when to imitate them remains challenging. We propose Smooth Policy Regularisation from Demonstrations (SPReD), a framework that addresses the fundamental question: when should an agent imitate a demonstration versus follow its own policy? SPReD uses ensemble methods to explicitly model Q-value distributions for both demonstration and policy actions, quantifying uncertainty for comparisons. We develop two complementary uncertainty-aware methods: a probabilistic approach estimating the likelihood of demonstration superiority, and an advantage-based approach scaling imitation by statistical significance.
Neural Information Processing Systems
Jun-18-2026, 13:44:00 GMT
- Country:
- North America > United States (0.28)
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Industry:
- Education (0.67)
- Information Technology (0.46)
- Banking & Finance (0.45)
- Technology: