Low-Dimensional State and Action Representation Learning with MDP Homomorphism Metrics

Botteghi, Nicolò, Poel, Mannes, Sirmacek, Beril, Brune, Christoph

arXiv.org Artificial Intelligence 

In the last decade, Deep Reinforcement Learning [1] algorithms have solved increasingly complicated problems in many different domains, spanning from video games [2] to numerous robotics applications [3], in an end-to-end fashion. Despite the success of end-to-end Reinforcement Learning, these methods suffer from low sample efficiency and usually requires lengthy and expensive training procedures to learn optimal behaviours. This problem is even more emphasized when the true state of the environment is not observable, and the observation space O or the action space A are high-dimensional. In end-to-end settings, due to the weak supervision of the reward signal, Reinforcement Learning algorithms are not enforced to learn good state representations of the environment, making the mapping observations to actions challenging to learn and interpret. State representation learning [4] methods aim at reducing the dimensionality of the observation stream by learning a mapping from the observation space O to a lower-dimensional state space S containing only the meaningful feature needed for solving a given task. By employing self-supervised auxiliary losses, it is possible to enforce optimal state representation and learn models of the underlying Markov Decision Process, or MDP. When policies are learned using the abstract or latent state-space variables, the training time is often reduced, the sample-efficiency, the robustness, and generalisation capabilities of the policies grow compared to end-to-end Reinforcement Learning [5], [6] and [7]. While the problem of state representation and observation compression has been extensively treated [4], only a few works have extended the concept of dimensionality reduction to the action space A. In this category, we find the works done in [8], [9] and [10] where low-dimensional action representations are used to improve training efficiency