Plotting

 Jalobeanu, Mihai


The Sandbox Environment for Generalizable Agent Research (SEGAR)

arXiv.org Artificial Intelligence

A broad challenge of research on generalization for sequential decision-making tasks in interactive environments is designing benchmarks that clearly landmark progress. While there has been notable headway, current benchmarks either do not provide suitable exposure nor intuitive control of the underlying factors, are not easy-to-implement, customizable, or extensible, or are computationally expensive to run. We built the Sandbox Environment for Generalizable Agent Research (SEGAR) with all of these things in mind. SEGAR improves the ease and accountability of generalization research in RL, as generalization objectives can be easy designed by specifying task distributions, which in turns allows the researcher to measure the nature of the generalization objective. We present an overview of SEGAR and how it contributes to these goals, as well as experiments that demonstrate a few types of research questions SEGAR can help answer.


PLEX: Making the Most of the Available Data for Robotic Manipulation Pretraining

arXiv.org Artificial Intelligence

Transformers [1] have lead to breakthroughs in training large-scale general representations for computer vision (CV) and natural language processing (NLP) [2], enabling zero-shot adaptation and fast finetuning [3]. At the same time, despite impressive progress, transformer-based representations haven't shown the same versatility for robotic manipulation. Some attribute this gap to the lack of suitable training data for robotics [3]. We argue instead that data relevant to training robotic manipulation models is copious but has important structure that most existing training methods ignore and fail to leverage. These insights lead us to propose a novel transformer-based architecture, called PLEX, that is capable of effective learning from realistically available robotic manipulation datasets. We observe that robotics-relevant data falls into three major categories: (1) Video-only data, which contain high-quality and potentially description-annotated demonstrations for an immense variety of tasks but have no explicit action information for a robot to mimic; (2) Data containing matching sequences of percepts and actions, which are less plentiful than pure videos and don't necessarily correspond to meaningful tasks [4], but capture valuable correlations between a robot's actions and changes in the environment and are easy to collect on a given robot; (3) Small sets of high-quality sensorimotor demonstrations for a target task in a target environment. Thus, a scalable model architecture for robotic manipulation must be able to learn primarily from videos, while being extra data-efficient on sensorimotor training sequences and the small amount target demonstrations. PLEX, the PLanning-EXecution architecture we propose, is designed to take advantage of data sources of these types.


Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control

arXiv.org Artificial Intelligence

Our goal is for robots to follow natural language instructions like "put the towel next to the microwave." But getting large amounts of labeled data, i.e. data that contains demonstrations of tasks labeled with the language instruction, is prohibitive. In contrast, obtaining policies that respond to image goals is much easier, because any autonomous trial or demonstration can be labeled in hindsight with its final state as the goal. In this work, we contribute a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data. Prior work has made progress on this using vision-language models or by jointly training language-goal-conditioned policies, but so far neither method has scaled effectively to real-world robot tasks without significant human annotation. Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image, but rather to the desired change between the start and goal images that the instruction corresponds to. We then train a policy on this embedding: the policy benefits from all the unlabeled data, but the aligned embedding provides an interface for language to steer the policy. We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data. Videos and code for our approach can be found on our website: https://rail-berkeley.github.io/grif/ .


Platform for Situated Intelligence

arXiv.org Artificial Intelligence

We introduce Platform for Situated Intelligence, an open-source framework created to support the rapid development and study of multimodal, integrative-AI systems. The framework provides infrastructure for sensing, fusing, and making inferences from temporal streams of data across different modalities, a set of tools that enable visualization and debugging, and an ecosystem of components that encapsulate a variety of perception and processing technologies. These assets jointly provide the means for rapidly constructing and refining multimodal, integrative-AI systems, while retaining the efficiency and performance characteristics required for deployment in open-world settings.