Decision S4: Efficient Sequence-Based RL via State Spaces Layers
Bar-David, Shmuel, Zimerman, Itamar, Nachmani, Eliya, Wolf, Lior
–arXiv.org Artificial Intelligence
Recently, sequence learning methods have been applied to the problem of off-policy Reinforcement Learning, including the seminal work on Decision Transformers, which employs transformers for this task. Since transformers are parameter-heavy, cannot benefit from history longer than a fixed window size, and are not computed using recurrence, we set out to investigate the suitability of the S4 family of models, which are based on state-space layers and have been shown to outperform transformers, especially in modeling long-range dependencies. In this work we present two main algorithms: (i) an off-policy training procedure that works with trajectories, while still maintaining the training efficiency of the S4 model. Our results indicate that our method outperforms multiple variants of decision transformers, as well as the other baseline methods on most tasks, while reducing the latency, number of parameters, and training time by several orders of magnitude, making our approach more suitable for real-world RL. Robots are naturally described as being in an observable state, having a multi-dimensional action space and striving to achieve a measurable goal. The complexity of these three elements, and the often non-differentiable links between them, such as the shift between the states given the action and the shift between the states and the reward (with the latter computed based on additional entities), make the use of Reinforcement Learning (RL) natural, see also (Kober et al., 2013; Ibarz et al., 2021). Off-policy RL has preferable sample complexity and is widely used in robotics research, e.g., (Haarnoja et al., 2018; Gu et al., 2017). However, with the advent of accessible physical simulations for generating data, learning complex tasks without a successful sample model is readily approached by on-policy methods Siekmann et al. (2021) and the same holds for the task of adversarial imitation learning Peng et al. (2021; 2022). The decision transformer of Chen et al. (2021) is a sequence-based off-policy RL method that considers sequences of tuples of the form (reward, state, action). Using the auto-regressive capability of transformers, it generates the next action given the desired reward and the current state. The major disadvantages of the decision transformer are the size of the architecture, which is a known limitation in these models, the inference runtime, which stems from the inability to compute the transformer recursively, and the fixed window size, which eliminates long-range dependencies. In this work, we propose a novel, sequence-based RL method that is far more efficient than the decision transformer and more suitable for capturing long-range effects. The method is based on the S4 sequence model, which was designed by Gu et al. (2021a). These authors contributed equally to this work.
arXiv.org Artificial Intelligence
Jun-8-2023
- Country:
- Asia (0.14)
- Genre:
- Research Report > New Finding (0.34)
- Industry:
- Leisure & Entertainment (0.67)
- Technology: