Learning One Representation to Optimize All Rewards

Dec-23-2025, 16:39:23 GMT–Neural Information Processing Systems

We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process. It provides explicit near-optimal policies for any reward specified a posteriori. During an unsupervised phase, we use reward-free interactions with the environment to learn two representations via off-the-shelf deep learning methods and temporal difference (TD) learning. In the test phase, a reward representation is estimated either from reward observations or an explicit reward description (e.g., a target state). The optimal policy for thatreward is directly obtained from these representations, with no planning.

name change, optimize, representation, (4 more...)

Neural Information Processing Systems

Dec-23-2025, 16:39:23 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (0.76)
  - Neural Networks > Deep Learning (0.59)