Review for NeurIPS paper: Reward Propagation Using Graph Convolutional Networks

Neural Information Processing Systems 

The method adapts a sampled trajectory-based approximation of the transition graphs. But given the trajectory samples, sequential models (RNN etc.) are sufficient to estimate the potential functions. It would be good if the authors can clarify the advantage and the necessity of GCN on the sampled trajectory inputs, compared to sequential models. The baselines, ICM and RND, are motivated to address hard exploration RL tasks, while the potential based reward shaping is motivated for faster convergence. They are related but address different issues. A more informative empirical comparison would be against the LIRPG (Learning Intrinsic Rewards for Policy Gradient from [a]), because both this paper and LIRPG aim at learning reward shaping for speed up policy learning.