gamma-model
Gamma-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction
We introduce the gamma-model, a predictive model of environment dynamics with an infinite, probabilistic horizon. Replacing standard single-step models with gamma-models leads to generalizations of the procedures that form the foundation of model-based control, including the model rollout and model-based value estimation. The gamma-model, trained with a generative reinterpretation of temporal difference learning, is a natural continuous analogue of the successor representation and a hybrid between model-free and model-based mechanisms. Like a value function, it contains information about the long-term future; like a standard predictive model, it is independent of task reward. We instantiate the gamma-model as both a generative adversarial network and normalizing flow, discuss how its training reflects an inescapable tradeoff between training-time and testing-time compounding errors, and empirically investigate its utility for prediction and control.
Review for NeurIPS paper: Gamma-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction
Weaknesses: - One weakness of the successor representation is that it is policy-dependent. So, in the control setting, it would need to be relearned whenever the policy is modified. On the other hand, perhaps one-step models would not suffer from this problem (since they are conditioned on actions too). Could you comment on this issue? So, it would seem like, when the model outputs a prediction, the agent would not know how far into the future this state is---it could be the very next state or far into the future.
Review for NeurIPS paper: Gamma-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction
Summary: this paper proposes a new model-based RL algorithm, where instead of learning state transition probabilities, the occupancy distribution for an infinite horizon is learned. This method can be seen as an extension of the method known as the successor representation to continuous state-action spaces and to infinite horizons. The occupancy distribution is modeled as an energy function, and learned with temporal differences (TD), using a GAN. The experiments on a few MuJuCo problems clearly show the advantages of the proposed approach compared to RL algorithms such as PPO and SAC. The reviewers agree that the proposed method is new, interesting, and validated by the simulation experiments.
Gamma-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction
We introduce the gamma-model, a predictive model of environment dynamics with an infinite, probabilistic horizon. Replacing standard single-step models with gamma-models leads to generalizations of the procedures that form the foundation of model-based control, including the model rollout and model-based value estimation. The gamma-model, trained with a generative reinterpretation of temporal difference learning, is a natural continuous analogue of the successor representation and a hybrid between model-free and model-based mechanisms. Like a value function, it contains information about the long-term future; like a standard predictive model, it is independent of task reward. We instantiate the gamma-model as both a generative adversarial network and normalizing flow, discuss how its training reflects an inescapable tradeoff between training-time and testing-time compounding errors, and empirically investigate its utility for prediction and control.