Goto

Collaborating Authors

 Agarwal, Tanmay


Modeling Dynamic Environments with Scene Graph Memory

arXiv.org Artificial Intelligence

We investigate a novel instance of this problem: temporal link Embodied AI agents that search for objects in prediction with partial observability, i.e. when the past large environments such as households often need observations of the graph contain only parts of it. This to make efficient decisions by predicting object locations setting maps naturally to a common problem in embodied based on partial information. We pose this AI: using past sensor observations to predict the state of a as a new type of link prediction problem: link dynamic environment represented by a graph. Graphs are prediction on partially observable dynamic used frequently as the state representation of large scenes graphs. Our graph is a representation of a scene in the form of scene graphs (Johnson et al., 2015; Armeni in which rooms and objects are nodes, and their et al., 2019; Ravichandran et al., 2022a; Hughes et al., 2022), relationships are encoded in the edges; only parts a relational object-centric representation where nodes are of the changing graph are known to the agent at objects or rooms, and edges encode relationships such as each timestep. This partial observability poses a inside or onTop. Link prediction could be applied to challenge to existing link prediction approaches, partially observed, dynamic scene graphs to infer relationships which we address. We propose a novel state representation between pairs of objects enabling various downstream - Scene Graph Memory (SGM) - with decision-making tasks for which scene graphs have been captures the agent's accumulated set of observations, shown to be useful such as navigation (Amiri et al., 2022; as well as a neural net architecture called a Santos & Romero, 2022), manipulation (Agia et al., 2022; Node Edge Predictor (NEP) that extracts information Zhu et al., 2021) and object search (Ravichandran et al., from the SGM to search efficiently. We evaluate 2022a; Xu et al., 2022).


The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects

arXiv.org Artificial Intelligence

We introduce the ObjectFolder Benchmark, a benchmark suite of 10 tasks for multisensory object-centric learning, centered around object recognition, reconstruction, and manipulation with sight, sound, and touch. We also introduce the ObjectFolder Real dataset, including the multisensory measurements for 100 real-world household objects, building upon a newly designed pipeline for collecting the 3D meshes, videos, impact sounds, and tactile readings of real-world objects. We conduct systematic benchmarking on both the 1,000 multisensory neural objects from ObjectFolder, and the real multisensory data from ObjectFolder Real. Our results demonstrate the importance of multisensory perception and reveal the respective roles of vision, audio, and touch for different object-centric learning tasks. By publicly releasing our dataset and benchmark suite, we hope to catalyze and enable new research in multisensory object-centric learning in computer vision, robotics, and beyond. Project page: https://objectfolder.stanford.edu


Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

arXiv.org Artificial Intelligence

We introduce Argoverse 2 (AV2) - a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26 object categories, all of which are sufficiently-sampled to support training and evaluation of 3D perception models. The Lidar Dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. This dataset is the largest ever collection of lidar sensor data and supports self-supervised learning and the emerging task of point cloud forecasting. Finally, the Motion Forecasting Dataset contains 250,000 scenarios mined for interesting and challenging interactions between the autonomous vehicle and other actors in each local scene. Models are tasked with the prediction of future motion for "scored actors" in each scenario and are provided with track histories that capture object location, heading, velocity, and category. In all three datasets, each scenario contains its own HD Map with 3D lane and crosswalk geometry - sourced from data captured in six distinct cities. We believe these datasets will support new and existing machine learning research problems in ways that existing datasets do not. All datasets are released under the CC BY-NC-SA 4.0 license.


Affordance-based Reinforcement Learning for Urban Driving

arXiv.org Artificial Intelligence

Traditional autonomous vehicle pipelines that follow a modular approach have been very successful in the past both in academia and industry, which has led to autonomy deployed on road. Though this approach provides ease of interpretation, its generalizability to unseen environments is limited and hand-engineering of numerous parameters is required, especially in the prediction and planning systems. Recently, deep reinforcement learning has been shown to learn complex strategic games and perform challenging robotic tasks, which provides an appealing framework for learning to drive. In this work, we propose a deep reinforcement learning framework to learn optimal control policy using waypoints and low-dimensional visual representations, also known as affordances. We demonstrate that our agents when trained from scratch learn the tasks of lane-following, driving around inter-sections as well as stopping in front of other actors or traffic lights even in the dense traffic setting. We note that our method achieves comparable or better performance than the baseline methods on the original and NoCrash benchmarks on the CARLA simulator.