Learning One Representation to Optimize All Rewards
–Neural Information Processing Systems
We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process. It provides explicit near-optimal policies for any reward specified a posteriori. During an unsupervised phase, we use reward-free interactions with the environment to learn two representations via off-the-shelf deep learning methods and temporal difference (TD) learning. In the test phase, a reward representation is estimated either from reward observations or an explicit reward description (e.g., a target state). The optimal policy for thatreward is directly obtained from these representations, with no planning.
Neural Information Processing Systems
Dec-23-2025, 16:39:23 GMT
- Technology: