TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning

Li, Ge, Tian, Dong, Zhou, Hongyi, Jiang, Xinkai, Lioutikov, Rudolf, Neumann, Gerhard

Oct-12-2024–arXiv.org Artificial Intelligence

This work introduces a novel off-policy Reinforcement Learning (RL) algorithm that utilizes a transformer-based architecture for predicting the state-action values for a sequence of actions. These returns are effectively used to update the policy that predicts a smooth trajectory instead of a single action in each decision step. Predicting a whole trajectory of actions is commonly done in episodic RL (ERL) (Kober & Peters, 2008) and differs conceptually from conventional step-based RL (SRL) methods like PPO (Schulman et al., 2017) and SAC (Haarnoja et al., 2018a) where an action is sampled in each time step. The action selection concept in ERL is promising as shown in recent works in RL (Otto et al., 2022; Li et al., 2024). Similar insights have been made in the field of Imitation Learning, where predicting action sequences instead of single actions has led to great success (Zhao et al., 2023; Reuss et al., 2024). Additionally, decision-making in ERL aligns with the human's decision-making strategy, where the human generally does not decide in each single time step but rather performs a whole sequence of actions to complete a task - for instance, swinging an arm to play tennis without overthinking each per-step movement. Episodic RL is a distinct family of RL that emphasizes the maximization of returns over entire episodes, typically lasting several seconds, rather than optimizing the intermediate states during environment interactions (Whitley et al., 1993; Igel, 2003; Peters & Schaal, 2008). Unlike SRL, ERL shifts the solution search from per-step actions to a parameterized trajectory space, leveraging techniques like Movement Primitives (MPs) (Schaal, 2006; Paraschos et al., 2013) for generating action sequences. This approach enables a broader exploration horizon (Kober & Peters, 2008), captures temporal and degrees of freedom (DoF) correlations (Li et al., 2024), and ensures smooth transitions between re-planning phases (Otto et al., 2023).

machine learning, reinforcement learning, trajectory, (15 more...)

arXiv.org Artificial Intelligence

Oct-12-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Germany
  - Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
- Asia
  - Vietnam > Long An Province (0.04)
  - Myanmar > Tanintharyi Region
    - Dawei (0.04)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Leisure & Entertainment > Sports (0.48)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (1.00)
  - Neural Networks > Deep Learning (1.00)