Sferrazza, Carmelo
MuJoCo Playground
Zakka, Kevin, Tabanpour, Baruch, Liao, Qiayuan, Haiderbhai, Mustafa, Holt, Samuel, Luo, Jing Yuan, Allshire, Arthur, Frey, Erik, Sreenath, Koushil, Kahrs, Lueder A., Sferrazza, Carmelo, Tassa, Yuval, Abbeel, Pieter
We introduce MuJoCo Playground, a fully open-source framework for robot learning built with MJX, with the express goal of streamlining simulation, training, and sim-to-real transfer onto robots. With a simple "pip install playground", researchers can train policies in minutes on a single GPU. Playground supports diverse robotic platforms, including quadrupeds, humanoids, dexterous hands, and robotic arms, enabling zero-shot sim-to-real transfer from both state and pixel inputs. This is achieved through an integrated stack comprising a physics engine, batch renderer, and training environments. Along with video results, the entire framework is freely available at playground.mujoco.org
Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding
Jones, Joshua, Mees, Oier, Sferrazza, Carmelo, Stachowicz, Kyle, Abbeel, Pieter, Levine, Sergey
Interacting with the world is a multi-sensory experience: achieving effective general-purpose interaction requires making use of all available modalities -- including vision, touch, and audio -- to fill in gaps from partial observation. For example, when vision is occluded reaching into a bag, a robot should rely on its senses of touch and sound. However, state-of-the-art generalist robot policies are typically trained on large datasets to predict robot actions solely from visual and proprioceptive observations. In this work, we propose FuSe, a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities for which large datasets are not readily available by leveraging natural language as a common cross-modal grounding. We combine a multimodal contrastive loss with a sensory-grounded language generation loss to encode high-level semantics. In the context of robot manipulation, we show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound in a zero-shot setting, such as multimodal prompting, compositional cross-modal prompting, and descriptions of objects it interacts with. We show that the same recipe is applicable to widely different generalist policies, including both diffusion-based generalist policies and large vision-language-action (VLA) models. Extensive experiments in the real world show that FuSeis able to increase success rates by over 20% compared to all considered baselines.
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization
Sukhija, Bhavya, Coros, Stelian, Krause, Andreas, Abbeel, Pieter, Sferrazza, Carmelo
Reinforcement learning (RL) algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. Most common RL algorithms use undirected exploration, i.e., select random sequences of actions. Exploration can also be directed using intrinsic rewards, such as curiosity or model epistemic uncertainty. However, effectively balancing task and intrinsic rewards is challenging and often task-dependent. In this work, we introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. MaxInfoRL steers exploration towards informative transitions, by maximizing intrinsic rewards such as the information gain about the underlying task. When combined with Boltzmann exploration, this approach naturally trades off maximization of the value function with that of the entropy over states, rewards, and actions. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits. We then apply this general formulation to a variety of off-policy model-free RL methods for continuous state-action spaces, yielding novel algorithms that achieve superior performance across hard exploration problems and complex scenarios such as visual control tasks.
ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning
As, Yarden, Sukhija, Bhavya, Treven, Lenart, Sferrazza, Carmelo, Coros, Stelian, Krause, Andreas
Reinforcement learning (RL) is ubiquitous in the development of modern AI systems. However, state-of-the-art RL agents require extensive, and potentially unsafe, interactions with their environments to learn effectively. These limitations confine RL agents to simulated environments, hindering their ability to learn directly in real-world settings. Despite the notable progress, its application without any use of simulators remains largely limited. This is primarily because, in many cases, RL methods require massive amounts of data for learning while also being inherently unsafe during exploration. In many real-world settings, environments are complex and rarely align exactly with the assumptions made in simulators. Learning directly in the real world allows RL systems to close the sim-to-real gap and continuously adapt to evolving environments and distribution shifts. However, to unlock these advantages, RL algorithms must be sample-efficient and ensure safety throughout the learning process to avoid costly failures or risks in high-stakes applications. For instance, agents learning driving policies in autonomous vehicles must prevent collisions with other cars or pedestrians, even when adapting to new driving environments. This challenge is known as safe exploration, where the agent's exploration is restricted by safety-critical, often unknown, constraints that must be satisfied throughout the learning process . Several works study safe exploration and have demonstrated state-of-the-art performance in terms of both safety and sample efficiency for learning in the real world (Sui et al., 2015; Wischnewski et al., 2019; Berkenkamp et al., 2021; Cooper & Netoff, 2022; Sukhija et al., 2023; Widmer et al., 2023). These methods maintain a "safe set" of policies during learning, selecting policies from this set to safely explore and gradually expand it. Under common regularity assumptions about the constraints, these approaches guarantee safety throughout learning. However, explictily maintaining and expanding a safe set, limits these methods to low-dimensional policies, such as PID controllers. This makes them difficult to scale to more complex tasks such as those considered in deep RL. To this end, we propose a scalable model-based RL algorithm - A CTS AFE - for efficient and safe exploration. Crucially, A CTS AFE learns an uncertainty-aware dynamics model, which it uses to implicitly define and expand the safe set of policies.
HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation
Sferrazza, Carmelo, Huang, Dun-Ming, Lin, Xingyu, Lee, Youngwoon, Abbeel, Pieter
Humanoid robots hold great promise in assisting humans in diverse environments and tasks, due to their flexibility and adaptability leveraging human-like morphology. However, research in humanoid robots is often bottlenecked by the costly and fragile hardware setups. To accelerate algorithmic research in humanoid robots, we present a high-dimensional, simulated robot learning benchmark, HumanoidBench, featuring a humanoid robot equipped with dexterous hands and a variety of challenging whole-body manipulation and locomotion tasks. Our findings reveal that state-of-the-art reinforcement learning algorithms struggle with most tasks, whereas a hierarchical learning approach achieves superior performance when supported by robust low-level policies, such as walking or reaching. With HumanoidBench, we provide the robotics community with a platform to identify the challenges arising when solving diverse tasks with humanoid robots, facilitating prompt verification of algorithms and ideas. The open-source code is available at https://humanoid-bench.github.io.
Optical Tactile Sensing for Aerial Multi-Contact Interaction: Design, Integration, and Evaluation
Aucone, Emanuele, Sferrazza, Carmelo, Gregor, Manuel, D'Andrea, Raffaello, Mintchev, Stefano
Distributed tactile sensing for multi-force detection is crucial for various aerial robot interaction tasks. However, current contact sensing solutions on drones only exploit single end-effector sensors and cannot provide distributed multi-contact sensing. Designed to be easily mounted at the bottom of a drone, we propose an optical tactile sensor that features a large and curved soft sensing surface, a hollow structure and a new illumination system. Even when spaced only 2 cm apart, multiple contacts can be detected simultaneously using our software pipeline, which provides real-world quantities of 3D contact locations (mm) and 3D force vectors (N), with an accuracy of 1.5 mm and 0.17 N respectively. We demonstrate the sensor's applicability and reliability onboard and in real-time with two demos related to i) the estimation of the compliance of different perches and subsequent re-alignment and landing on the stiffer one, and ii) the mapping of sparse obstacles. The implementation of our distributed tactile sensor represents a significant step towards attaining the full potential of drones as versatile robots capable of interacting with and navigating within complex environments.
The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning
Sferrazza, Carmelo, Seo, Younggyo, Liu, Hao, Lee, Youngwoon, Abbeel, Pieter
Humans rely on the synergy of their senses for most essential tasks. For tasks requiring object manipulation, we seamlessly and effectively exploit the complementarity of our senses of vision and touch. This paper draws inspiration from such capabilities and aims to find a systematic approach to fuse visual and tactile information in a reinforcement learning setting. We propose Masked Multimodal Learning (M3L), which jointly learns a policy and visual-tactile representations based on masked autoencoding. The representations jointly learned from vision and touch improve sample efficiency, and unlock generalization capabilities beyond those achievable through each of the senses separately. Remarkably, representations learned in a multimodal setting also benefit vision-only policies at test time. We evaluate M3L on three simulated environments with both visual and tactile observations: robotic insertion, door opening, and dexterous in-hand manipulation, demonstrating the benefits of learning a multimodal policy. Code and videos of the experiments are available at https://sferrazza.cc/m3l_site.
Chain of Hindsight Aligns Language Models with Feedback
Liu, Hao, Sferrazza, Carmelo, Abbeel, Pieter
Learning from human preferences is important for language models to match human needs and to align with human and social values. Prior works have achieved remarkable successes by learning from human feedback to understand and follow instructions. Nonetheless, these methods are either founded on hand-picked model generations that are favored by human annotators, rendering them inefficient in terms of data utilization and challenging to apply in general, or they depend on reinforcement learning, which often suffers from imperfect reward functions and relies on extremely challenging optimizations. In this work, we propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. Our idea is inspired by how humans learn from extensive feedback presented in the form of languages. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model, allowing us to take advantage of the language comprehension capabilities of language models. We condition the model on a sequence of model generations paired with feedback. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors. Applying our method to large language models, we observed that Chain of Hindsight significantly surpasses previous methods in aligning language models with human preferences. We report significant improvements on summarization and dialogue benchmarks, with our approach markedly preferred in human evaluations.
Language Reward Modulation for Pretraining Reinforcement Learning
Adeniji, Ademi, Xie, Amber, Sferrazza, Carmelo, Seo, Younggyo, James, Stephen, Abbeel, Pieter
Using learned reward functions (LRFs) as a means to solve sparse-reward reinforcement learning (RL) tasks has yielded some steady progress in task-complexity through the years. In this work, we question whether today's LRFs are best-suited as a direct replacement for task rewards. Instead, we propose leveraging the capabilities of LRFs as a pretraining signal for RL. Concretely, we propose $\textbf{LA}$nguage Reward $\textbf{M}$odulated $\textbf{P}$retraining (LAMP) which leverages the zero-shot capabilities of Vision-Language Models (VLMs) as a $\textit{pretraining}$ utility for RL as opposed to a downstream task reward. LAMP uses a frozen, pretrained VLM to scalably generate noisy, albeit shaped exploration rewards by computing the contrastive alignment between a highly diverse collection of language instructions and the image observations of an agent in its pretraining environment. LAMP optimizes these rewards in conjunction with standard novelty-seeking exploration rewards with reinforcement learning to acquire a language-conditioned, pretrained policy. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks in RLBench.
Zero-shot sim-to-real transfer of tactile control policies for aggressive swing-up manipulation
Bi, Thomas, Sferrazza, Carmelo, D'Andrea, Raffaello
This paper aims to show that robots equipped with a vision-based tactile sensor can perform dynamic manipulation tasks without prior knowledge of all the physical attributes of the objects to be manipulated. For this purpose, a robotic system is presented that is able to swing up poles of different masses, radii and lengths, to an angle of 180 degrees, while relying solely on the feedback provided by the tactile sensor. This is achieved by developing a novel simulator that accurately models the interaction of a pole with the soft sensor. A feedback policy that is conditioned on a sensory observation history, and which has no prior knowledge of the physical features of the pole, is then learned in the aforementioned simulation. When evaluated on the physical system, the policy is able to swing up a wide range of poles that differ significantly in their physical attributes without further adaptation. To the authors' knowledge, this is the first work where a feedback policy from high-dimensional tactile observations is used to control the swing-up manipulation of poles in closed-loop.