Levine, Sergey
OGBench: Benchmarking Offline Goal-Conditioned RL
Park, Seohong, Frans, Kevin, Eysenbach, Benjamin, Levine, Sergey
Offline goal-conditioned reinforcement learning (GCRL) is a major problem in reinforcement learning (RL) because it provides a simple, unsupervised, and domain-agnostic way to acquire diverse behaviors and representations from unlabeled data without rewards. Despite the importance of this setting, we lack a standard benchmark that can systematically evaluate the capabilities of offline GCRL algorithms. In this work, we propose OGBench, a new, high-quality benchmark for algorithms research in offline goal-conditioned RL. OGBench consists of 8 types of environments, 85 datasets, and reference implementations of 6 representative offline GCRL algorithms. We have designed these challenging and realistic environments and datasets to directly probe different capabilities of algorithms, such as stitching, long-horizon reasoning, and the ability to handle high-dimensional inputs and stochasticity. While representative algorithms may rank similarly on prior benchmarks, our experiments reveal stark strengths and weaknesses in these different capabilities, providing a strong foundation for building new algorithms. Project page: https://seohong.me/projects/ogbench
Value-Based Deep RL Scales Predictably
Rybkin, Oleh, Nauman, Michal, Fu, Preston, Snell, Charlie, Abbeel, Pieter, Levine, Sergey, Kumar, Aviral
Scaling data and compute is critical to the success of machine learning. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between hyperparameters, which is used to manage effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.
Flow Q-Learning
Park, Seohong, Li, Qiyang, Levine, Sergey
However, leveraging flow or diffusion models to parameterize Offline reinforcement learning (RL) enables training an effective policies for offline RL is not a trivial problem. Unlike decision-making policy from a previously collected with simpler policy classes, such as Gaussian policies, there dataset without costly environment interactions (Lange et al., is no straightforward way to train the flow or diffusion policies 2012; Levine et al., 2020). The essence of offline RL to maximize a learned value function, due to the iterative is constrained optimization: the agent must maximize returns nature of these generative models. This is an example while staying within the dataset's state-action distribution of a policy extraction problem, which is known to be a key (Levine et al., 2020). As datasets have grown larger and challenge in offline RL in general (Park et al., 2024a). Previous more diverse (Collaboration et al., 2024), their behavioral works have devised diverse ways to extract an iterative distributions have become more complex and multimodal, generative policy from a learned value function, based and this often necessitates an expressive policy class (Mandlekar on weighted regression, reparameterized policy gradient, rejection et al., 2021) capable of capturing these complex distributions sampling, and other techniques. While they have and implementing a more precise behavioral constraint.
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Chu, Tianzhe, Zhai, Yuexiang, Yang, Jihan, Tong, Shengbang, Xie, Saining, Schuurmans, Dale, Le, Quoc V., Levine, Sergey, Ma, Yi
Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Pertsch, Karl, Stachowicz, Kyle, Ichter, Brian, Driess, Danny, Nair, Suraj, Vuong, Quan, Mees, Oier, Finn, Chelsea, Levine, Sergey
Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the pi0 VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.
Reward-Guided Controlled Generation for Inference-Time Alignment in Diffusion Models: Tutorial and Review
Uehara, Masatoshi, Zhao, Yulai, Wang, Chenyu, Li, Xiner, Regev, Aviv, Levine, Sergey, Biancalani, Tommaso
This tutorial provides an in-depth guide on inference-time guidance and alignment methods for optimizing downstream reward functions in diffusion models. While diffusion models are renowned for their generative modeling capabilities, practical applications in fields such as biology often require sample generation that maximizes specific metrics (e.g., stability, affinity in proteins, closeness to target structures). In these scenarios, diffusion models can be adapted not only to generate realistic samples but also to explicitly maximize desired measures at inference time without fine-tuning. This tutorial explores the foundational aspects of such inference-time algorithms. We review these methods from a unified perspective, demonstrating that current techniques -- such as Sequential Monte Carlo (SMC)-based guidance, value-based sampling, and classifier guidance -- aim to approximate soft optimal denoising processes (a.k.a. policies in RL) that combine pre-trained denoising processes with value functions serving as look-ahead functions that predict from intermediate states to terminal rewards. Within this framework, we present several novel algorithms not yet covered in the literature. Furthermore, we discuss (1) fine-tuning methods combined with inference-time techniques, (2) inference-time algorithms based on search algorithms such as Monte Carlo tree search, which have received limited attention in current research, and (3) connections between inference-time algorithms in language models and diffusion models. The code of this tutorial on protein design is available at https://github.com/masa-ue/AlignInversePro
Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding
Jones, Joshua, Mees, Oier, Sferrazza, Carmelo, Stachowicz, Kyle, Abbeel, Pieter, Levine, Sergey
Interacting with the world is a multi-sensory experience: achieving effective general-purpose interaction requires making use of all available modalities -- including vision, touch, and audio -- to fill in gaps from partial observation. For example, when vision is occluded reaching into a bag, a robot should rely on its senses of touch and sound. However, state-of-the-art generalist robot policies are typically trained on large datasets to predict robot actions solely from visual and proprioceptive observations. In this work, we propose FuSe, a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities for which large datasets are not readily available by leveraging natural language as a common cross-modal grounding. We combine a multimodal contrastive loss with a sensory-grounded language generation loss to encode high-level semantics. In the context of robot manipulation, we show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound in a zero-shot setting, such as multimodal prompting, compositional cross-modal prompting, and descriptions of objects it interacts with. We show that the same recipe is applicable to widely different generalist policies, including both diffusion-based generalist policies and large vision-language-action (VLA) models. Extensive experiments in the real world show that FuSeis able to increase success rates by over 20% compared to all considered baselines.
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents
Zhou, Yifei, Yang, Qianlan, Lin, Kaixiang, Bai, Min, Zhou, Xiong, Wang, Yu-Xiong, Levine, Sergey, Li, Erran
The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the generalization capability of foundation models. Such a generalist agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of human-annotated instructions, the agent's skill repertoire will necessarily be limited due to the quantity and diversity of human-annotated instructions. In this work, we address this challenge by proposing Proposer-Agent-Evaluator, an effective learning system that enables foundation model agents to autonomously discover and practice skills in the wild. At the heart of PAE is a context-aware task proposer that autonomously proposes tasks for the agent to practice with context information of the environment such as user demos or even just the name of the website itself for Internet-browsing agents. Then, the agent policy attempts those tasks with thoughts and actual grounded operations in the real world with resulting trajectories evaluated by an autonomous VLM-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and self-hosted websites from WebVoyager and WebArena.To the best of our knowledge, this work represents the first effective learning system to apply autonomous task proposal with RL for agents that generalizes real-world human-annotated benchmarks with SOTA performances. Our open-source checkpoints and code can be found in https://yanqval.github.io/PAE/
RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning
Xu, Charles, Li, Qiyang, Luo, Jianlan, Levine, Sergey
Recent advances in robotic foundation models have enabled the development of generalist policies that can adapt to diverse tasks. While these models show impressive flexibility, their performance heavily depends on the quality of their training data. In this work, we propose Reinforcement Learning Distilled Generalists (RLDG), a method that leverages reinforcement learning to generate high-quality training data for finetuning generalist policies. Through extensive real-world experiments on precise manipulation tasks like connector insertion and assembly, we demonstrate that generalist policies trained with RL-generated data consistently outperform those trained with human demonstrations, achieving up to 40% higher success rates while generalizing better to new tasks. We also provide a detailed analysis that reveals this performance gain stems from both optimized action distributions and improved state coverage. Our results suggest that combining task-specific RL with generalist policy distillation offers a promising approach for developing more capable and efficient robotic manipulation systems that maintain the flexibility of foundation models while achieving the performance of specialized controllers. Videos and code can be found on our project website https://generalist-distillation.github.io
Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data
Zhou, Zhiyuan, Peng, Andy, Li, Qiyang, Levine, Sergey, Kumar, Aviral
The predominant paradigm for learning at scale today involves pre-training models on diverse prior data, and then fine-tuning them on narrower domain-specific data to specialize them to particular downstream tasks [7, 4, 9, 37, 55, 50, 59]. In the context of learning decision-making policies, this paradigm translates to pre-training on a large amount of previously collected static experience via offline reinforcement learning (RL), followed by fine-tuning these initializations via online RL efficiently. Generally, this fine-tuning is done by continuing training with the very same offline RL algorithm, e.g., pessimistic [28, 6] algorithms or algorithms that apply behavioral constraints [14, 27], on a mixture of offline data and autonomous online data, with minor modifications to the offline RL algorithm itself [33]. While this paradigm has led to promising results [27, 33], RL fine-tuning requires continued training on offline data for stability and performance ([56, 57]; Section 3), as opposed to the standard practice in machine learning. Retaining offline data is problematic for several reasons. First, as offline datasets grow in size and diversity, continued online training on offline data becomes inefficient and expensive, and such computation requirements may even deter practitioners from using online RL for fine-tuning. Second, the need for retaining offline data perhaps defeats the point of offline RL pre-training altogether: recent results [47], corroborated by our experiments in Section 3, indicate that current fine-tuning approaches are not able to make good use of several strong offline RL value and/or policy initializations, as shown by the superior performance of running online RL from scratch with offline data put in the replay buffer [3]. These problems put the efficacy of current RL fine-tuning approaches into question. In this paper, we aim to understand and address the aforementioned shortcomings of current online finetuning methods and build an online RL approach that does not retain offline data.