Not enough data to create a plot.
Try a different view from the menu above.
Blukis, Valts
RVT: Robotic View Transformer for 3D Object Manipulation
Goyal, Ankit, Xu, Jie, Guo, Yijie, Blukis, Valts, Chao, Yu-Wei, Fox, Dieter
For 3D object manipulation, methods that build an explicit 3D representation perform better than those relying only on camera images. But using explicit 3D representations like voxels comes at large computing cost, adversely affecting scalability. In this work, we propose RVT, a multi-view transformer for 3D manipulation that is both scalable and accurate. Some key features of RVT are an attention mechanism to aggregate information across views and re-rendering of the camera input from virtual views around the robot workspace. In simulations, we find that a single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than the existing state-of-the-art method (PerAct). It also trains 36X faster than PerAct for achieving the same performance and achieves 2.3X the inference speed of PerAct. Further, RVT can perform a variety of manipulation tasks in the real world with just a few ($\sim$10) demonstrations per task. Visual results, code, and trained model are provided at https://robotic-view-transformer.github.io/.
BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects
Wen, Bowen, Tremblay, Jonathan, Blukis, Valts, Tyree, Stephen, Muller, Thomas, Evans, Alex, Fox, Dieter, Kautz, Jan, Birchfield, Stan
We present a near real-time method for 6-DoF tracking of an unknown object from a monocular RGBD video sequence, while simultaneously performing neural 3D reconstruction of the object. Our method works for arbitrary rigid objects, even when visual texture is largely absent. The object is assumed to be segmented in the first frame only. No additional information is required, and no assumption is made about the interaction agent. Key to our method is a Neural Object Field that is learned concurrently with a pose graph optimization process in order to robustly accumulate information into a consistent 3D representation capturing both geometry and appearance. A dynamic pool of posed memory frames is automatically maintained to facilitate communication between these threads. Our approach handles challenging sequences with large pose changes, partial and full occlusion, untextured surfaces, and specular highlights. We show results on HO3D, YCBInEOAT, and BEHAVE datasets, demonstrating that our method significantly outperforms existing approaches. Project page: https://bundlesdf.github.io
A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution
Blukis, Valts, Paxton, Chris, Fox, Dieter, Garg, Animesh, Artzi, Yoav
Natural language provides an accessible and expressive interface to specify long-term tasks for robotic agents. However, non-experts are likely to specify such tasks with high-level instructions, which abstract over specific robot actions through several layers of abstraction. We propose that key to bridging this gap between language and robot actions over long execution horizons are persistent representations. We propose a persistent spatial semantic representation method, and show how it enables building an agent that performs hierarchical reasoning to effectively execute long-term tasks. We evaluate our approach on the ALFRED benchmark and achieve state-of-the-art results, despite completely avoiding the commonly used step-by-step instructions.
Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following
Blukis, Valts, Knepper, Ross A., Artzi, Yoav
We study the problem of learning a robot policy to follow natural language instructions that can be easily extended to reason about new objects. We introduce a few-shot language-conditioned object grounding method trained from augmented reality data that uses exemplars to identify objects and align them to their mentions in instructions. We present a learned map representation that encodes object locations and their instructed use, and construct it from our few-shot grounding output. We integrate this mapping approach into an instruction-following policy, thereby allowing it to reason about previously unseen objects at test-time by simply adding exemplars. We evaluate on the task of learning to map raw observations and instructions to continuous control of a physical quadcopter. Our approach significantly outperforms the prior state of the art in the presence of new objects, even when the prior approach observes all objects during training.
Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight
Blukis, Valts, Terme, Yannick, Niklasson, Eyvind, Knepper, Ross A., Artzi, Yoav
Abstract: We propose a joint simulation and real-world learning framework for mapping navigation instructions and raw first-person observations to continuous control. Our model estimates the need for environment exploration, predicts the likelihood of visiting environment positions during execution, and controls the agent to both explore and visit high-likelihood positions. We introduce Supervised Reinforcement Asynchronous Learning (SuReAL). Learning uses both simulation and real environments without requiring autonomous flight in the physical environment during training, and combines supervised learning for predicting positions to visit and reinforcement learning for continuous control. We evaluate our approach on a natural language instruction-following task with a physical quad-copter, and demonstrate effective execution and exploration behavior.
Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction
Blukis, Valts, Misra, Dipendra, Knepper, Ross A., Artzi, Yoav
Executing natural language navigation instructions from raw observations requires solving language, perception, planning, and control problems. Consider instructing a quadcopter drone using natural language. Figure 1 shows an example instruction. Resolving the instruction requires identifying the blue fence, anvil and tree in the world, understanding the spatial constraints towards and on the right, planning a trajectory that satisfies these constraints, and continuously controlling the quadcopter to follow the trajectory. Existing work has addressed this problem mostly using manually-designed symbolic representations for language meaning and environment [1, 2, 3, 4, 5, 6].
Following High-level Navigation Instructions on a Simulated Quadcopter with Imitation Learning
Blukis, Valts, Brukhim, Nataly, Bennett, Andrew, Knepper, Ross A., Artzi, Yoav
We introduce a method for following high-level navigation instructions by mapping directly from images, instructions and pose estimates to continuous low-level velocity commands for real-time control. The Grounded Semantic Mapping Network (GSMN) is a fully-differentiable neural network architecture that builds an explicit semantic map in the world reference frame by incorporating a pinhole camera projection model within the network. The information stored in the map is learned from experience, while the local-to-world transformation is computed explicitly. We train the model using DAggerFM, a modified variant of DAgger that trades tabular convergence guarantees for improved training speed and memory use. We test GSMN in virtual environments on a realistic quadcopter simulator and show that incorporating an explicit mapping and grounding modules allows GSMN to outperform strong neural baselines and almost reach an expert policy performance. Finally, we analyze the learned map representations and show that using an explicit map leads to an interpretable instruction-following model.