Heim, Steve
FLD: Fourier Latent Dynamics for Structured Motion Representation and Learning
Li, Chenhao, Stanger-Jones, Elijah, Heim, Steve, Kim, Sangbae
Motion trajectories offer reliable references for physics-based motion learning but suffer from sparsity, particularly in regions that lack sufficient data coverage. To address this challenge, we introduce a self-supervised, structured representation and generation method that extracts spatial-temporal relationships in periodic or quasi-periodic motions. The motion dynamics in a continuously parameterized latent space enable our method to enhance the interpolation and generalization capabilities of motion learning algorithms. The motion learning controller, informed by the motion parameterization, operates online tracking of a wide range of motions, including targets unseen during training. With a fallback mechanism, the controller dynamically adapts its tracking strategy and automatically resorts to safe action execution when a potentially risky target is proposed. By leveraging the identified spatial-temporal structure, our work opens new possibilities for future advancements in general motion representation and learning algorithms. The availability of reference trajectories, such as motion capture data, has significantly propelled the advancement of motion learning techniques (Peng et al., 2018; Bergamin et al., 2019; Peng et al., 2021; 2022; Starke et al., 2022; Li et al., 2023b;a). However, it is difficult to generalize policies using these techniques to motions outside the distribution of the available data (Peng et al., 2020; Li et al., 2023a). A core reason is that, while the trajectories in the data itself are induced by some dynamics of the system, the learned policies are typically trained to only replicate the data, instead of understanding the underlying dynamics structure. In other words, the policies attempt to memorize the trajectory instances rather than learn to predict them systematically. Moreover, the high nonlinearity and the embedded high-level similarity hinder datadriven methods from effectively identifying and modeling the dynamics of motion patterns (Peng et al., 2018). Therefore, addressing these challenges requires systematic understanding and leveraging the structured nature of the motion space. Instead of handling raw motion trajectories in long-horizon, high-dimensional state space, structured representation methods introduce certain inductive biases during training and offer an efficient approach to managing complex movements (Min & Chai, 2012; Lee et al., 2021). These methods focus on extracting the essential features and temporal dependencies of motions, enabling more effective and compact representations (Lee et al., 2010; Levine et al., 2012). The ability to understand and capture the spatial-temporal structure of the motion space offers enhanced interpolation and generalization capabilities that can augment training datasets and improve the effectiveness of motion generation algorithms (Holden et al., 2017; Iscen et al., 2018; Ibarz et al., 2021).
Learning Emergent Gaits with Decentralized Phase Oscillators: on the role of Observations, Rewards, and Feedback
Zhang, Jenny, Heim, Steve, Jeon, Se Hwan, Kim, Sangbae
We present a minimal phase oscillator model for learning quadrupedal locomotion. Each of the four oscillators is coupled only to itself and its corresponding leg through local feedback of the ground reaction force, which can be interpreted as an observer feedback gain. We interpret the oscillator itself as a latent contact state-estimator. Through a systematic ablation study, we show that the combination of phase observations, simple phase-based rewards, and the local feedback dynamics induces policies that exhibit emergent gait preferences, while using a reduced set of simple rewards, and without prescribing a specific gait. The code is open-source, and a video synopsis available at https://youtu.be/1NKQ0rSV3jU.
Benchmarking Potential Based Rewards for Learning Humanoid Locomotion
Jeon, Se Hwan, Heim, Steve, Khazoom, Charles, Kim, Sangbae
The main challenge in developing effective reinforcement learning (RL) pipelines is often the design and tuning the reward functions. Well-designed shaping reward can lead to significantly faster learning. Naively formulated rewards, however, can conflict with the desired behavior and result in overfitting or even erratic performance if not properly tuned. In theory, the broad class of potential based reward shaping (PBRS) can help guide the learning process without affecting the optimal policy. Although several studies have explored the use of potential based reward shaping to accelerate learning convergence, most have been limited to grid-worlds and low-dimensional systems, and RL in robotics has predominantly relied on standard forms of reward shaping. In this paper, we benchmark standard forms of shaping with PBRS for a humanoid robot. We find that in this high-dimensional system, PBRS has only marginal benefits in convergence speed. However, the PBRS reward terms are significantly more robust to scaling than typical reward shaping approaches, and thus easier to tune.
Safe Value Functions
Massiani, Pierre-François, Heim, Steve, Solowjow, Friedrich, Trimpe, Sebastian
Safety constraints and optimality are important, but sometimes conflicting criteria for controllers. Although these criteria are often solved separately with different tools to maintain formal guarantees, it is also common practice in reinforcement learning to simply modify reward functions by penalizing failures, with the penalty treated as a mere heuristic. We rigorously examine the relationship of both safety and optimality to penalties, and formalize sufficient conditions for safe value functions (SVFs): value functions that are both optimal for a given task, and enforce safety constraints. We reveal this structure by examining when rewards preserve viability under optimal control, and show that there always exists a finite penalty that induces a safe value function. This penalty is not unique, but upper-unbounded: larger penalties do not harm optimality. Although it is often not possible to compute the minimum required penalty, we reveal clear structure of how the penalty, rewards, discount factor, and dynamics interact. This insight suggests practical, theory-guided heuristics to design reward functions for control problems where safety is important.
A Learnable Safety Measure
Heim, Steve, von Rohr, Alexander, Trimpe, Sebastian, Badri-Spröwitz, Alexander
Failures are challenging for learning to control physical systems since they risk damage, time-consuming resets, and often provide little gradient information. Adding safety constraints to exploration typically requires a lot of prior knowledge and domain expertise. We present a safety measure which implicitly captures how the system dynamics relate to a set of failure states. Not only can this measure be used as a safety function, but also to directly compute the set of safe state-action pairs. Further, we show a model-free approach to learn this measure by active sampling using Gaussian processes. While safety can only be guaranteed after learning the safety measure, we show that failures can already be greatly reduced by using the estimated measure during learning.
Learning from Outside the Viability Kernel: Why we Should Build Robots that can Fall with Grace
Heim, Steve, Spröwitz, Alexander
Despite impressive results using reinforcement learning to solve complex problems from scratch, in robotics this has still been largely limited to model-based learning with very informative reward functions. One of the major challenges is that the reward landscape often has large patches with no gradient, making it difficult to sample gradients effectively. We show here that the robot state-initialization can have a more important effect on the reward landscape than is generally expected. In particular, we show the counter-intuitive benefit of including initializations that are unviable, in other words initializing in states that are doomed to fail.