Pan, Hsiao-Ru
Skill or Luck? Return Decomposition via Advantage Functions
Pan, Hsiao-Ru, Schölkopf, Bernhard
Learning from off-policy data is essential for sample-efficient reinforcement learning. In the present work, we build on the insight that the advantage function can be understood as the causal effect of an action on the return, and show that this allows us to decompose the return of a trajectory into parts caused by the agent's actions (skill) and parts outside of the agent's control (luck). Furthermore, this decomposition enables us to naturally extend Direct Advantage Estimation (DAE) to off-policy settings (Off-policy DAE). The resulting method can learn from off-policy trajectories without relying on importance sampling techniques or truncating off-policy actions. We draw connections between Off-policy DAE and previous methods to demonstrate how it can speed up learning and when the proposed off-policy corrections are important. Finally, we use the MinAtar environments to illustrate how ignoring off-policy corrections can lead to suboptimal policy optimization performance.
Homomorphism Autoencoder -- Learning Group Structured Representations from Observed Transitions
Keurti, Hamza, Pan, Hsiao-Ru, Besserve, Michel, Grewe, Benjamin F., Schölkopf, Bernhard
Humans acquire such internal models by interacting with the world, but the learning principles allowing it How can agents learn internal models that veridically remain elusive. We investigate how Machine Learning (ML) represent interactions with the real world can shed light on this question, as it moves towards representations is a largely open question. As machine learning that carry more than just observational information is moving towards representations containing not (Sutton & Barto, 2015; Schölkopf et al., 2021) and develops just observational but also interventional knowledge, tools for interactive and geometric structure learning (Cohen we study this problem using tools from representation & Welling, 2016; Eslami et al., 2018), learning and group theory. We propose methods enabling an agent acting upon the Our setting is inspired by neuroscientific evidence that, as world to learn internal representations of sensory animals use their motor apparatus to act, efference copies of information that are consistent with actions that motor signals are sent to the brain's sensory system where modify it. We use an autoencoder equipped with they are integrated with incoming sensory observations to a group representation acting on its latent space, predict future sensory inputs (Keller et al., 2012). We argue trained using an equivariance-derived loss in order that such efference copies can be useful for learning to enforce a suitable homomorphism property on structured latent representations of sensory observations the group representation. In contrast to existing and for disentangling the key latent factors of behavioral work, our approach does not require prior knowledge relevance. This view is also in line with hypotheses formulated of the group and does not restrict the set of by developmental psychology (Piaget, 1964), stating actions the agent can perform.
Direct Advantage Estimation
Pan, Hsiao-Ru, Gürtler, Nico, Neitz, Alexander, Schölkopf, Bernhard
The predominant approach in reinforcement learning is to assign credit to actions based on the expected return. However, we show that the return may depend on the policy in a way which could lead to excessive variance in value estimation and slow down learning. Instead, we show that the advantage function can be interpreted as causal effects and shares similar properties with causal representations. Based on this insight, we propose Direct Advantage Estimation (DAE), a novel method that can model the advantage function and estimate it directly from on-policy data while simultaneously minimizing the variance of the return without requiring the (action-)value function. We also relate our method to Temporal Difference methods by showing how value functions can be seamlessly integrated into DAE. The proposed method is easy to implement and can be readily adapted by modern actor-critic methods. We evaluate DAE empirically on three discrete control domains and show that it can outperform generalized advantage estimation (GAE), a strong baseline for advantage estimation, on a majority of the environments when applied to policy optimization.