To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable RL
–Neural Information Processing Systems
Partial observability is a notorious challenge in reinforcement learning (RL), due to the need to learn complex, history-dependent policies. Recent empirical successes have used -- which leverages availability of latent state information during training (e.g., from a simulator) to learn and imitate the optimal latent, Markovian policy -- to disentangle the task of ''learning to see'' from ''learning to act''. While expert distillation is more computationally efficient than RL without latent state information, it also has well-documented failure modes. In this paper -- through a simple but instructive theoretical model called the, and controlled experiments on challenging simulated locomotion tasks -- we investigate the algorithmic trade-off between privileged expert distillation and standard RL without privileged information.
Neural Information Processing Systems
Jun-13-2026, 01:08:27 GMT