Riedmiller, Martin
Preference Optimization as Probabilistic Inference
Abdolmaleki, Abbas, Piot, Bilal, Shahriari, Bobak, Springenberg, Jost Tobias, Hertweck, Tim, Joshi, Rishabh, Oh, Junhyuk, Bloesch, Michael, Lampe, Thomas, Heess, Nicolas, Buchli, Jonas, Riedmiller, Martin
The use of preference annotated data for training machine learning models has a long history going back to early algorithms for recommender systems and market research (Bonilla et al., 2010; Boutilier, 2002; Guo and Sanner, 2010). These days preference optimization algorithms are receiving renewed attention since they are a natural candidate for shaping the outputs of deep learning systems, such as large language models (Ouyang et al., 2022; Team et al., 2024) or control policies, via human feedback (Azar et al., 2023; Christiano et al., 2017; Rafailov et al., 2023). Arguably, preference optimization algorithms can also be a natural choice even when direct human feedback is not available but one instead aims to optimize a machine learning model based on feedback from a hand-coded or learned critic function (judging desirability of solutions). Here preference optimization methods are useful since they let us optimize the model to achieve desired outcomes based on relative rankings between outcomes alone (rather than requiring absolute labels or carefully crafted reward functions). Among preference optimization approaches, those based on directly using preference data - as opposed to casting preference optimization as reinforcement learning from (human) feedback - such as DPO (Rafailov et al., 2023), have emerged as particularly successful since they only require access to an offline dataset of paired preference data, and are fairly robust to application domain and hyperparameter settings. However, algorithms within this class make specific assumptions tailored to their application domain. They were designed to optimize LLMs from human feedback in the form of comparisons of generated sentences and thus, by design, require paired preference data (since they directly model a specific choice of preference distribution). We are interested in finding algorithms that are more flexible, and applicable in settings where the assumptions underlying DPO do not apply.
Learning Robot Soccer from Egocentric Vision with Deep Reinforcement Learning
Tirumala, Dhruva, Wulfmeier, Markus, Moran, Ben, Huang, Sandy, Humplik, Jan, Lever, Guy, Haarnoja, Tuomas, Hasenclever, Leonard, Byravan, Arunkumar, Batchelor, Nathan, Sreendra, Neil, Patel, Kushal, Gwira, Marlon, Nori, Francesco, Riedmiller, Martin, Heess, Nicolas
We apply multi-agent deep reinforcement learning (RL) to train end-to-end robot soccer policies with fully onboard computation and sensing via egocentric RGB vision. This setting reflects many challenges of real-world robotics, including active perception, agile full-body control, and long-horizon planning in a dynamic, partially-observable, multi-agent domain. We rely on large-scale, simulation-based data generation to obtain complex behaviors from egocentric vision which can be successfully transferred to physical robots using low-cost sensors. To achieve adequate visual realism, our simulation combines rigid-body physics with learned, realistic rendering via multiple Neural Radiance Fields (NeRFs). We combine teacher-based multi-agent RL and cross-experiment data reuse to enable the discovery of sophisticated soccer strategies. We analyze active-perception behaviors including object tracking and ball seeking that emerge when simply optimizing perception-agnostic soccer play. The agents display equivalent levels of performance and agility as policies with access to privileged, ground-truth state. To our knowledge, this paper constitutes a first demonstration of end-to-end training for multi-agent robot soccer, mapping raw pixel observations to joint-level actions, that can be deployed in the real world. Videos of the game-play and analyses can be seen on our website https://sites.google.com/view/vision-soccer .
Real-World Fluid Directed Rigid Body Control via Deep Reinforcement Learning
Bhardwaj, Mohak, Lampe, Thomas, Neunert, Michael, Romano, Francesco, Abdolmaleki, Abbas, Byravan, Arunkumar, Wulfmeier, Markus, Riedmiller, Martin, Buchli, Jonas
Recent advances in real-world applications of reinforcement learning (RL) have relied on the ability to accurately simulate systems at scale. However, domains such as fluid dynamical systems exhibit complex dynamic phenomena that are hard to simulate at high integration rates, limiting the direct application of modern deep RL algorithms to often expensive or safety critical hardware. In this work, we introduce "Box o Flows", a novel benchtop experimental control system for systematically evaluating RL algorithms in dynamic real-world scenarios. We describe the key components of the Box o Flows, and through a series of experiments demonstrate how state-of-the-art model-free RL algorithms can synthesize a variety of complex behaviors via simple reward specifications. Furthermore, we explore the role of offline RL in data-efficient hypothesis testing by reusing past experiences. We believe that the insights gained from this preliminary study and the availability of systems like the Box o Flows support the way forward for developing systematic RL algorithms that can be generally applied to complex, dynamical systems. Supplementary material and videos of experiments are available at https://sites.google.com/view/box-o-flows/home.
Offline Actor-Critic Reinforcement Learning Scales to Large Models
Springenberg, Jost Tobias, Abdolmaleki, Abbas, Zhang, Jingwei, Groth, Oliver, Bloesch, Michael, Lampe, Thomas, Brakel, Philemon, Bechtle, Sarah, Kapturowski, Steven, Hafner, Roland, Heess, Nicolas, Riedmiller, Martin
We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. We introduce a Perceiver-based actor-critic model and elucidate the key model features needed to make offline RL work with self- and cross-attention modules. Overall, we find that: i) simple offline actor critic algorithms are a natural choice for gradually moving away from the currently predominant paradigm of behavioral cloning, and ii) via offline RL it is possible to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data.
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
Bousmalis, Konstantinos, Vezzani, Giulia, Rao, Dushyant, Devin, Coline, Lee, Alex X., Bauza, Maria, Davchev, Todor, Zhou, Yuxiang, Gupta, Agrim, Raju, Akhil, Laurens, Antoine, Fantacci, Claudio, Dalibard, Valentin, Zambelli, Martina, Martins, Murilo, Pevceviciute, Rugile, Blokzijl, Michiel, Denil, Misha, Batchelor, Nathan, Lampe, Thomas, Parisotto, Emilio, Żołna, Konrad, Reed, Scott, Colmenarejo, Sergio Gómez, Scholz, Jon, Abdolmaleki, Abbas, Groth, Oliver, Regli, Jean-Baptiste, Sushkov, Oleg, Rothörl, Tom, Chen, José Enrique, Aytar, Yusuf, Barker, Dave, Ortiz, Joy, Riedmiller, Martin, Springenberg, Jost Tobias, Hadsell, Raia, Nori, Francesco, Heess, Nicolas
The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task generalist agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100-1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent's capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.
Mastering Stacking of Diverse Shapes with Large-Scale Iterative Reinforcement Learning on Real Robots
Lampe, Thomas, Abdolmaleki, Abbas, Bechtle, Sarah, Huang, Sandy H., Springenberg, Jost Tobias, Bloesch, Michael, Groth, Oliver, Hafner, Roland, Hertweck, Tim, Neunert, Michael, Wulfmeier, Markus, Zhang, Jingwei, Nori, Francesco, Heess, Nicolas, Riedmiller, Martin
Reinforcement learning solely from an agent's self-generated data is often believed to be infeasible for learning on real robots, due to the amount of data needed. However, if done right, agents learning from real data can be surprisingly efficient through re-using previously collected sub-optimal data. In this paper we demonstrate how the increased understanding of off-policy learning methods and their embedding in an iterative online/offline scheme (``collect and infer'') can drastically improve data-efficiency by using all the collected experience, which empowers learning from real robot experience only. Moreover, the resulting policy improves significantly over the state of the art on a recently proposed real robot manipulation benchmark. Our approach learns end-to-end, directly from pixels, and does not rely on additional human domain knowledge such as a simulator or demonstrations.
Less is more -- the Dispatcher/ Executor principle for multi-task Reinforcement Learning
Riedmiller, Martin, Hertweck, Tim, Hafner, Roland
Humans instinctively know how to neglect details when it comes to solve complex decision making problems in environments with unforeseeable variations. This abstraction process seems to be a vital property for most biological systems and helps to'abstract away' unnecessary details and boost generalisation. In this work we introduce the dispatcher/ executor principle for the design of multi-task Reinforcement Learning controllers. It suggests to partition the controller in two entities, one that understands the task (the dispatcher) and one that computes the controls for the specific device (the executor) - and to connect these two by a strongly regularizing communication channel. The core rationale behind this position paper is that changes in structure and design principles can improve generalisation properties and drastically enforce data-efficiency. It is in some sense a'yes, and ...' response to the current trend of using large neural networks trained on vast amounts of data and bet on emerging generalisation properties. While we agree on the power of scaling - in the sense of Sutton's'bitter lesson' - we will give some evidence, that considering structure and adding design principles can be a valuable and critical component in particular when data is not abundant and infinite, but is a precious resource. A video showing the results can be found at https://sites.google.com/view/dispatcher-executor.
Replay across Experiments: A Natural Extension of Off-Policy RL
Tirumala, Dhruva, Lampe, Thomas, Chen, Jose Enrique, Haarnoja, Tuomas, Huang, Sandy, Lever, Guy, Moran, Ben, Hertweck, Tim, Hasenclever, Leonard, Riedmiller, Martin, Heess, Nicolas, Wulfmeier, Markus
Replaying data is a principal mechanism underlying the stability and data efficiency of off-policy reinforcement learning (RL). We present an effective yet simple framework to extend the use of replays across multiple experiments, minimally adapting the RL workflow for sizeable improvements in controller performance and research iteration times. At its core, Replay Across Experiments (RaE) involves reusing experience from previous experiments to improve exploration and bootstrap learning while reducing required changes to a minimum in comparison to prior work. We empirically show benefits across a number of RL algorithms and challenging control domains spanning both locomotion and manipulation, including hard exploration tasks from egocentric vision. Through comprehensive ablations, we demonstrate robustness to the quality and amount of data available and various hyperparameter choices. Finally, we discuss how our approach can be applied more broadly across research life cycles and can increase resilience by reloading data across random seeds or hyperparameter variations.
Real Robot Challenge 2022: Learning Dexterous Manipulation from Offline Data in the Real World
Gürtler, Nico, Widmaier, Felix, Sancaktar, Cansu, Blaes, Sebastian, Kolev, Pavel, Bauer, Stefan, Wüthrich, Manuel, Wulfmeier, Markus, Riedmiller, Martin, Allshire, Arthur, Wang, Qiang, McCarthy, Robert, Kim, Hangyeol, Baek, Jongchan, Kwon, Wookyong, Qian, Shanliang, Toshimitsu, Yasunori, Michelis, Mike Yan, Kazemipour, Amirhossein, Raayatsanati, Arman, Zheng, Hehui, Cangan, Barnabas Gavin, Schölkopf, Bernhard, Martius, Georg
Experimentation on real robots is demanding in terms of time and costs. For this reason, a large part of the reinforcement learning (RL) community uses simulators to develop and benchmark algorithms. However, insights gained in simulation do not necessarily translate to real robots, in particular for tasks involving complex interactions with the environment. The Real Robot Challenge 2022 therefore served as a bridge between the RL and robotics communities by allowing participants to experiment remotely with a real robot - as easily as in simulation. In the last years, offline reinforcement learning has matured into a promising paradigm for learning from pre-collected datasets, alleviating the reliance on expensive online interactions. We therefore asked the participants to learn two dexterous manipulation tasks involving pushing, grasping, and in-hand orientation from provided real-robot datasets. An extensive software documentation and an initial stage based on a simulation of the real set-up made the competition particularly accessible. By giving each team plenty of access budget to evaluate their offline-learned policies on a cluster of seven identical real TriFinger platforms, we organized an exciting competition for machine learners and roboticists alike. In this work we state the rules of the competition, present the methods used by the winning teams and compare their results with a benchmark of state-of-the-art offline RL algorithms on the challenge datasets.
Towards practical reinforcement learning for tokamak magnetic control
Tracey, Brendan D., Michi, Andrea, Chervonyi, Yuri, Davies, Ian, Paduraru, Cosmin, Lazic, Nevena, Felici, Federico, Ewalds, Timo, Donner, Craig, Galperti, Cristian, Buchli, Jonas, Neunert, Michael, Huber, Andrea, Evens, Jonathan, Kurylowicz, Paula, Mankowitz, Daniel J., Riedmiller, Martin, Team, The TCV
Reinforcement learning (RL) has shown promising results for real-time control systems, including the domain of plasma magnetic control. However, there are still significant drawbacks compared to traditional feedback control approaches for magnetic confinement. In this work, we address key drawbacks of the RL method; achieving higher control accuracy for desired plasma properties, reducing the steady-state error, and decreasing the required time to learn new tasks. We build on top of \cite{degrave2022magnetic}, and present algorithmic improvements to the agent architecture and training procedure. We present simulation results that show up to 65\% improvement in shape accuracy, achieve substantial reduction in the long-term bias of the plasma current, and additionally reduce the training time required to learn new tasks by a factor of 3 or more. We present new experiments using the upgraded RL-based controllers on the TCV tokamak, which validate the simulation results achieved, and point the way towards routinely achieving accurate discharges using the RL approach.