A Proofs Throughout this section, we use p(s =a) to denote the probability of the state-action pair at time step t being equal to (s, a), and the probability of a trajectory by p(τ) = p(s, a

Neural Information Processing Systems 

Let's first consider the minimum for ˆV, Next, we prove the second part of the theorem regarding f. Note that, unlike the original PPO which samples mini-batches of frames, we sample on a trajectory-by-trajectory basis. For example, assume the batch size is 256 and n = 128 for the backup horizon, then each batch contains 2 128-step trajectories. C.1 Computational resources All the experiments were performed on an internal cluster of NVIDIA A100 GPUs. Training a MinAtar agent in a single environment takes less than 30 minutes (wall-clock time).