Reward Shaping via Diffusion Process in Reinforcement Learning
–arXiv.org Artificial Intelligence
In this article, I take inspiration from stochastic thermodynamics to derive a problem formulation for online learning in uncertain MDPs while grounded in system dynamics. The system balances the diffusion process with drif dynamics as a way to formulate the explorationexploitation trade-off. To this effect, I make an explicit link between the information entropy and the stochastic dynamics of a system coupled to an environment. I analyze various sources of entropy production: due to the decision-maker's uncertainty about the system-environment interaction characteristics; due to the stochastic nature of system dynamics; and the interaction of the decision maker's knowledge with system dynamics. This analysis provides a framework that can be formulated either as a maximum entropy program to derive efficient policies that balance the exploration and exploitation trade-off, or as a modified cost optimization program that includes informational costs and benefits.
arXiv.org Artificial Intelligence
Jun-20-2023
- Country:
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.14)
- North America > United States (0.93)
- Europe > United Kingdom
- Genre:
- Research Report (0.50)
- Industry:
- Energy > Oil & Gas
- Upstream (0.34)
- Health & Medicine (0.93)
- Energy > Oil & Gas