Gilitschenski, Igor
GeoMatch++: Morphology Conditioned Geometry Matching for Multi-Embodiment Grasping
Wei, Yunze, Attarian, Maria, Gilitschenski, Igor
As we aspire to solve more dexterous tasks in robotics, multi-finger grasping becomes of increasing importance. However, the varying degrees of freedom (DoF) of end-effectors and high multimodality of grasping modes depending on both end-effectors and objects, still pose open challenges. Previous works in grasping focus on parallel grippers [1, 2, 3], a single multi-finger gripper [4, 5, 6, 7], or a shared policy for multiple dexterous grippers [8, 9, 10, 11]. However, even methods that explore cross-embodiment mostly focus on generalization to unseen objects, and still show limited zero-shot generalization to unseen grippers. In this work, we propose GeoMatch++, a multi-embodiment grasping method which improves out-of-domain generalization on unseen grippers by leveraging robot morphology. Intuitively, robot morphology is essential to grasping - various end-effectors may have a different number of fingers, but fingertips and palm tend to be the most frequent contact regions. Thus, we hypothesize that learning good morphology embeddings can lead to a transferable grasping policy between different robots. Our main contribution is learning geometry correlation features between objects and end-effector morphology, which improve out-of-domain grasp success by 9.64% compared to previous methods, and our method showcases a minimal decrease in performance compared to in-domain evaluation.
SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation
Namekata, Koichi, Bahmani, Sherwin, Wu, Ziyi, Kant, Yash, Gilitschenski, Igor, Lindell, David B.
Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guided$\unicode{x2013}$offering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external knowledge. Our zero-shot method outperforms unsupervised baselines while significantly narrowing down the performance gap with supervised models in terms of visual quality and motion fidelity.
Feed-Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos
Liang, Hanxue, Ren, Jiawei, Mirzaei, Ashkan, Torralba, Antonio, Liu, Ziwei, Gilitschenski, Igor, Fidler, Sanja, Oztireli, Cengiz, Ling, Huan, Gojcic, Zan, Huang, Jiahui
Recent advancements in static feed-forward scene reconstruction have demonstrated significant progress in high-quality novel view synthesis. However, these models often struggle with generalizability across diverse environments and fail to effectively handle dynamic content. We present BTimer (short for BulletTimer), the first motion-aware feed-forward model for real-time reconstruction and novel view synthesis of dynamic scenes. Our approach reconstructs the full scene in a 3D Gaussian Splatting representation at a given target ('bullet') timestamp by aggregating information from all the context frames. Such a formulation allows BTimer to gain scalability and generalization by leveraging both static and dynamic scene datasets. Given a casual monocular dynamic video, BTimer reconstructs a bullet-time scene within 150ms while reaching state-of-the-art performance on both static and dynamic scene datasets, even compared with optimization-based approaches.
MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL
Voelcker, Claas A, Hussing, Marcel, Eaton, Eric, Farahmand, Amir-massoud, Gilitschenski, Igor
Building deep reinforcement learning (RL) agents that find a good policy with few samples has proven notoriously challenging. To achieve sample efficiency, recent work has explored updating neural networks with large numbers of gradient steps for every new sample. While such high update-to-data (UTD) ratios have shown strong empirical performance, they also introduce instability to the training process. Previous approaches need to rely on periodic neural network parameter resets to address this instability, but restarting the training process is infeasible in many real-world applications and requires tuning the resetting interval. In this paper, we focus on one of the core difficulties of stable training with limited samples: the inability of learned value functions to generalize to unobserved on-policy actions. We mitigate this issue directly by augmenting the off-policy RL training process with a small amount of data generated from a learned world model. Our method, Model-Augmented Data for Temporal Difference learning (MAD-TD) uses small amounts of generated data to stabilize high UTD training and achieve competitive performance on the most challenging tasks in the DeepMind control suite. Our experiments further highlight the importance of employing a good model to generate data, MAD-TD's ability to combat value overestimation, and its practical stability gains for continued learning.
Realistic Evaluation of Model Merging for Compositional Generalization
Tam, Derek, Kant, Yash, Lester, Brian, Gilitschenski, Igor, Raffel, Colin
Merging has become a widespread way to cheaply combine individual models into a single model that inherits their capabilities and attains better performance. This popularity has spurred rapid development of many new merging methods, which are typically validated in disparate experimental settings and frequently differ in the assumptions made about model architecture, data availability, and computational budget. In this work, we characterize the relative merits of different merging methods by evaluating them in a shared experimental setting and precisely identifying the practical requirements of each method. Specifically, our setting focuses on using merging for compositional generalization of capabilities in image classification, image generation, and natural language processing. Additionally, we measure the computational costs of different merging methods as well as how they perform when scaling the number of models being merged. Taken together, our results clarify the state of the field of model merging and provide a comprehensive and rigorous experimental setup to test new methods.
Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention
Gu, Xunjiang, Song, Guanyu, Gilitschenski, Igor, Pavone, Marco, Ivanovic, Boris
Understanding road geometry is a critical component of the autonomous vehicle (AV) stack. While high-definition (HD) maps can readily provide such information, they suffer from high labeling and maintenance costs. Accordingly, many recent works have proposed methods for estimating HD maps online from sensor data. The vast majority of recent approaches encode multi-camera observations into an intermediate representation, e.g., a bird's eye view (BEV) grid, and produce vector map elements via a decoder. While this architecture is performant, it decimates much of the information encoded in the intermediate representation, preventing downstream tasks (e.g., behavior prediction) from leveraging them. In this work, we propose exposing the rich internal features of online map estimation methods and show how they enable more tightly integrating online mapping with trajectory forecasting. In doing so, we find that directly accessing internal BEV features yields up to 73% faster inference speeds and up to 29% more accurate predictions on the real-world nuScenes dataset.
When does Self-Prediction help? Understanding Auxiliary Tasks in Reinforcement Learning
Voelcker, Claas, Kastner, Tyler, Gilitschenski, Igor, Farahmand, Amir-massoud
We investigate the impact of auxiliary learning tasks such as observation reconstruction and latent self-prediction on the representation learning problem in reinforcement learning. We also study how they interact with distractions and observation functions in the MDP. We provide a theoretical analysis of the learning dynamics of observation reconstruction, latent self-prediction, and TD learning in the presence of distractions and observation functions under linear model assumptions. With this formalization, we are able to explain why latent-self prediction is a helpful \emph{auxiliary task}, while observation reconstruction can provide more useful features when used in isolation. Our empirical analysis shows that the insights obtained from our learning dynamics framework predicts the behavior of these loss functions beyond the linear model assumption in non-linear neural networks. This reinforces the usefulness of the linear model framework not only for theoretical analysis, but also practical benefit for applied problems.
Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models
Wu, Ziyi, Rubanova, Yulia, Kabra, Rishabh, Hudson, Drew A., Gilitschenski, Igor, Aytar, Yusuf, van Steenkiste, Sjoerd, Allen, Kelsey R., Kipf, Thomas
We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling visual representations of objects from a reference image, such as a frame in a video, and are trained to reconstruct the respective objects in a different image, e.g., a later frame in the video. Importantly, we encode object visuals from the reference image while conditioning on object poses from the target frame. This enables learning disentangled appearance and pose features. Combining visual and 3D pose representations in a sequence-of-tokens format allows us to keep the text-to-image architecture of existing models, with Neural Assets in place of text tokens. By fine-tuning a pre-trained text-to-image diffusion model with this information, our approach enables fine-grained 3D pose and placement control of individual objects in a scene. We further demonstrate that Neural Assets can be transferred and recomposed across different scenes. Our model achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets (Objectron, Waymo Open).
Producing and Leveraging Online Map Uncertainty in Trajectory Prediction
Gu, Xunjiang, Song, Guanyu, Gilitschenski, Igor, Pavone, Marco, Ivanovic, Boris
High-definition (HD) maps have played an integral role in the development of modern autonomous vehicle (AV) stacks, albeit with high associated labeling and maintenance costs. As a result, many recent works have proposed methods for estimating HD maps online from sensor data, enabling AVs to operate outside of previously-mapped regions. However, current online map estimation approaches are developed in isolation of their downstream tasks, complicating their integration in AV stacks. In particular, they do not produce uncertainty or confidence estimates. In this work, we extend multiple state-of-the-art online map estimation methods to additionally estimate uncertainty and show how this enables more tightly integrating online mapping with trajectory forecasting. In doing so, we find that incorporating uncertainty yields up to 50% faster training convergence and up to 15% better prediction performance on the real-world nuScenes driving dataset.
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
Jain, Vidhi, Attarian, Maria, Joshi, Nikhil J, Wahid, Ayzaan, Driess, Danny, Vuong, Quan, Sanketi, Pannag R, Sermanet, Pierre, Welker, Stefan, Chan, Christine, Gilitschenski, Igor, Bisk, Yonatan, Dwibedi, Debidatta
While large-scale robotic systems typically rely on textual instructions for tasks, this work explores a different approach: can robots infer the task directly from observing humans? This shift necessitates the robot's ability to decode human intent and translate it into executable actions within its physical constraints and environment. We introduce Vid2Robot, a novel end-to-end video-based learning framework for robots. Given a video demonstration of a manipulation task and current visual observations, Vid2Robot directly produces robot actions. This is achieved through a unified representation model trained on a large dataset of human video and robot trajectory. The model leverages cross-attention mechanisms to fuse prompt video features to the robot's current state and generate appropriate actions that mimic the observed task. To further improve policy performance, we propose auxiliary contrastive losses that enhance the alignment between human and robot video representations. We evaluate Vid2Robot on real-world robots, demonstrating a 20% improvement in performance compared to other video-conditioned policies when using human demonstration videos. Additionally, our model exhibits emergent capabilities, such as successfully transferring observed motions from one object to another, and long-horizon composition, thus showcasing its potential for real-world applications. Project website: vid2robot.github.io