Goto

Collaborating Authors

 perception and planning


Enhanced Robot Planning and Perception through Environment Prediction

arXiv.org Artificial Intelligence

Mobile robots rely on maps to navigate through an environment. In the absence of any map, the robots must build the map online from partial observations as they move in the environment. Traditional methods build a map using only direct observations. In contrast, humans identify patterns in the observed environment and make informed guesses about what to expect ahead. Modeling these patterns explicitly is difficult due to the complexity of the environments. However, these complex models can be approximated well using learning-based methods in conjunction with large training data. By extracting patterns, robots can use direct observations and predictions of what lies ahead to better navigate an unknown environment. In this dissertation, we present several learning-based methods to equip mobile robots with prediction capabilities for efficient and safer operation. In the first part of the dissertation, we learn to predict using geometrical and structural patterns in the environment. Partially observed maps provide invaluable cues for accurately predicting the unobserved areas. We first demonstrate the capability of general learning-based approaches to model these patterns for a variety of overhead map modalities. Then we employ task-specific learning for faster navigation in indoor environments by predicting 2D occupancy in the nearby regions. This idea is further extended to 3D point cloud representation for object reconstruction. Predicting the shape of the full object from only partial views, our approach paves the way for efficient next-best-view planning. In the second part of the dissertation, we learn to predict using spatiotemporal patterns in the environment. We focus on dynamic tasks such as target tracking and coverage where we seek decentralized coordination between robots. We first show how graph neural networks can be used for more scalable and faster inference.


Perception Helps Planning: Facilitating Multi-Stage Lane-Level Integration via Double-Edge Structures

arXiv.org Artificial Intelligence

When planning for autonomous driving, it is crucial to consider essential traffic elements such as lanes, intersections, traffic regulations, and dynamic agents. However, they are often overlooked by the traditional end-to-end planning methods, likely leading to inefficiencies and non-compliance with traffic regulations. In this work, we endeavor to integrate the perception of these elements into the planning task. To this end, we propose Perception Helps Planning (PHP), a novel framework that reconciles lane-level planning with perception. This integration ensures that planning is inherently aligned with traffic constraints, thus facilitating safe and efficient driving. Specifically, PHP focuses on both edges of a lane for planning and perception purposes, taking into consideration the 3D positions of both lane edges and attributes for lane intersections, lane directions, lane occupancy, and planning. In the algorithmic design, the process begins with the transformer encoding multi-camera images to extract the above features and predicting lane-level perception results. Next, the hierarchical feature early fusion module refines the features for predicting planning attributes. Finally, the double-edge interpreter utilizes a late-fusion process specifically designed to integrate lane-level perception and planning information, culminating in the generation of vehicle control signals. Experiments on three Carla benchmarks show significant improvements in driving score of 27.20%, 33.47%, and 15.54% over existing algorithms, respectively, achieving the state-of-the-art performance, with the system operating up to 22.57 FPS.


ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

arXiv.org Artificial Intelligence

For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, which do not scale well in larger environments, nor do they contain semantic spatial relationships between entities in the environment, which are useful for downstream planning. In this work, we propose ConceptGraphs, an open-vocabulary graph-structured representation for 3D scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association. The resulting representations generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models. We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer video: https://youtu.be/mRhNkQwRYnc )


Communication-Critical Planning via Multi-Agent Trajectory Exchange

arXiv.org Artificial Intelligence

This paper addresses the task of joint multi-agent perception and planning, especially as it relates to the real-world challenge of collision-free navigation for connected self-driving vehicles. For this task, several communication-enabled vehicles must navigate through a busy intersection while avoiding collisions with each other and with obstacles. To this end, this paper proposes a learnable costmap-based planning mechanism, given raw perceptual data, that is (1) distributed, (2) uncertainty-aware, and (3) bandwidth-efficient. Our method produces a costmap and uncertainty-aware entropy map to sort and fuse candidate trajectories as evaluated across multiple-agents. The proposed method demonstrates several favorable performance trends on a suite of open-source overhead datasets as well as within a novel communication-critical simulator. It produces accurate semantic occupancy forecasts as an intermediate perception output, attaining a 72.5% average pixel-wise classification accuracy. By selecting the top trajectory, the multi-agent method scales well with the number of agents, reducing the hard collision rate by up to 57% with eight agents compared to the single-agent version.


Robots Getting a Grip on General Manipulation

IEEE Spectrum Robotics

This is a guest post. The views expressed here are solely those of the author and do not represent positions of IEEE Spectrum or the IEEE. While robots have prepared entire breakfasts since 1961, general manipulation in the real world is arguably an even more complex problem than autonomous driving. It is difficult to pinpoint exactly why, though. Closely watching the 1961 video suggests that a two-finger parallel gripper is good enough for a variety of tasks, and that it is only perception and encoded common sense that prevents a robot from performing such feats in the real world.