Ramos, Fabio
Controlled Latent Diffusion Models for 3D Porous Media Reconstruction
Naiff, Danilo, Schaeffer, Bernardo P., Pires, Gustavo, Stojkovic, Dragan, Rapstine, Thomas, Ramos, Fabio
Three-dimensional digital reconstruction of porous media presents a fundamental challenge in geoscience, requiring simultaneous resolution of fine-scale pore structures while capturing representative elementary volumes. We introduce a computational framework that addresses this challenge through latent diffusion models operating within the EDM framework. Our approach reduces dimensionality via a custom variational autoencoder trained in binary geological volumes, improving efficiency and also enabling the generation of larger volumes than previously possible with diffusion models. A key innovation is our controlled unconditional sampling methodology, which enhances distribution coverage by first sampling target statistics from their empirical distributions, then generating samples conditioned on these values. Extensive testing on four distinct rock types demonstrates that conditioning on porosity - a readily computable statistic - is sufficient to ensure a consistent representation of multiple complex properties, including permeability, two-point correlation functions, and pore size distributions. The framework achieves better generation quality than pixel-space diffusion while enabling significantly larger volume reconstruction (256-cube voxels) with substantially reduced computational requirements, establishing a new state-of-the-art for digital rock physics applications.
Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control
NVIDIA, null, :, null, Alhaija, Hassan Abu, Alvarez, Jose, Bala, Maciej, Cai, Tiffany, Cao, Tianshi, Cha, Liz, Chen, Joshua, Chen, Mike, Ferroni, Francesco, Fidler, Sanja, Fox, Dieter, Ge, Yunhao, Gu, Jinwei, Hassani, Ali, Isaev, Michael, Jannaty, Pooya, Lan, Shiyi, Lasser, Tobias, Ling, Huan, Liu, Ming-Yu, Liu, Xian, Lu, Yifan, Luo, Alice, Ma, Qianli, Mao, Hanzi, Ramos, Fabio, Ren, Xuanchi, Shen, Tianchang, Tang, Shitao, Wang, Ting-Chun, Wu, Jay, Xu, Jiashu, Xu, Stella, Xie, Kevin, Ye, Yuchong, Yang, Xiaodong, Zeng, Xiaohui, Zeng, Yu
We introduce Cosmos-Transfer1, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack.
HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation
Li, Yi, Deng, Yuquan, Zhang, Jesse, Jang, Joel, Memmel, Marius, Yu, Raymond, Garrett, Caelan Reed, Ramos, Fabio, Fox, Dieter, Li, Anqi, Gupta, Abhishek, Goyal, Ankit
Large foundation models have shown strong open-world generalization to complex problems in vision and language, but similar levels of generalization have yet to be achieved in robotics. One fundamental challenge is the lack of robotic data, which are typically obtained through expensive on-robot operation. A promising remedy is to leverage cheaper, off-domain data such as action-free videos, hand-drawn sketches or simulation data. In this work, we posit that hierarchical vision-language-action (VLA) models can be more effective in utilizing off-domain data than standard monolithic VLA models that directly finetune vision-language models (VLMs) to predict actions. In particular, we study a class of hierarchical VLA models, where the high-level VLM is finetuned to produce a coarse 2D path indicating the desired robot end-effector trajectory given an RGB image and a task description. The intermediate 2D path prediction is then served as guidance to the low-level, 3D-aware control policy capable of precise manipulation. Doing so alleviates the high-level VLM from fine-grained action prediction, while reducing the low-level policy's burden on complex task-level reasoning. We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios, including differences on embodiments, dynamics, visual appearances and task semantics, etc. In the real-robot experiments, we observe an average of 20% improvement in success rate across seven different axes of generalization over OpenVLA, representing a 50% relative gain. Visual results are provided at: https://hamster-robot.github.io/
Aim My Robot: Precision Local Navigation to Any Object
Meng, Xiangyun, Yang, Xuning, Jung, Sanghun, Ramos, Fabio, Jujjavarapu, Srid Sadhan, Paul, Sanjoy, Fox, Dieter
Abstract-- Existing navigation systems mostly consider "success" when the robot reaches within 1m radius to a goal. To this end, we design and implement Aim-My-Robot (AMR), a local navigation system that enables a robot to reach any object in its vicinity at the desired relative pose, with centimeterlevel precision. AMR shows strong sim2real transfer and can adapt to different robot kinematics and unseen objects with little to no fine-tuning. But this usually requires specific the goal reached when the robot is within 1m radius to the object information such as 3D models [13], and the object goal [8], [11], [12]. This lax definition of success hinders being initially visible. This limits its applicability when the their applicability to the growing need for mobile robots to object 3D model is not available or the object is initially out navigate to objects with precisely.
Grasping by parallel shape matching
Zhang, Wenzheng, Maken, Fahira Afzal, Lai, Tin, Ramos, Fabio
Grasping is essential in robotic manipulation, yet challenging due to object and gripper diversity and real-world complexities. Traditional analytic approaches often have long optimization times, while data-driven methods struggle with unseen objects. This paper formulates the problem as a rigid shape matching between gripper and object, which optimizes with Annealed Stein Iterative Closest Point (AS-ICP) and leverages GPU-based parallelization. By incorporating the gripper's tool center point and the object's center of mass into the cost function and using a signed distance field of the gripper for collision checking, our method achieves robust grasps with low computational time. Experiments with the Kinova KG3 gripper show an 87.3% success rate and 0.926 s computation time across various objects and settings, highlighting its potential for real-world applications.
Differentiable GPU-Parallelized Task and Motion Planning
Shen, William, Garrett, Caelan, Goyal, Ankit, Hermans, Tucker, Ramos, Fabio
We present a differentiable optimization-based framework for Task and Motion Planning (TAMP) that is massively parallelizable on GPUs, enabling thousands of sampled seeds to be optimized simultaneously. Existing sampling-based approaches inherently disconnect the parameters by generating samples for each independently and combining them through composition and rejection, while optimization-based methods struggle with highly non-convex constraints and local optima. Our method treats TAMP constraint satisfaction as optimizing a batch of particles, each representing an assignment to a plan skeleton's continuous parameters. We represent the plan skeleton's constraints using differentiable cost functions, enabling us to compute the gradient of each particle and update it toward satisfying solutions. Our use of GPU parallelism better covers the parameter space through scale, increasing the likelihood of finding the global optima by exploring multiple basins through global sampling. We demonstrate that our algorithm can effectively solve a highly constrained Tetris packing problem using a Franka arm in simulation and deploy our planner on a real robot arm. Website: https://williamshen-nz.github.io/gpu-tamp
Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints
Kumar, Nishanth, Ramos, Fabio, Fox, Dieter, Garrett, Caelan Reed
Foundation models trained on internet-scale data, such as Vision-Language Models (VLMs), excel at performing tasks involving common sense, such as visual question answering. Despite their impressive capabilities, these models cannot currently be directly applied to challenging robot manipulation problems that require complex and precise continuous reasoning. Task and Motion Planning (TAMP) systems can control high-dimensional continuous systems over long horizons through combining traditional primitive robot operations. However, these systems require detailed model of how the robot can impact its environment, preventing them from directly interpreting and addressing novel human objectives, for example, an arbitrary natural language goal. We propose deploying VLMs within TAMP systems by having them generate discrete and continuous language-parameterized constraints that enable TAMP to reason about open-world concepts. Specifically, we propose algorithms for VLM partial planning that constrain a TAMP system's discrete temporal search and VLM continuous constraints interpretation to augment the traditional manipulation constraints that TAMP systems seek to satisfy. We demonstrate our approach on two robot embodiments, including a real world robot, across several manipulation tasks, where the desired objectives are conveyed solely through language.
Ergodic Trajectory Optimization on Generalized Domains Using Maximum Mean Discrepancy
Hughes, Christian, Warren, Houston, Lee, Darrick, Ramos, Fabio, Abraham, Ian
We present a novel formulation of ergodic trajectory optimization that can be specified over general domains using kernel maximum mean discrepancy. Ergodic trajectory optimization is an effective approach that generates coverage paths for problems related to robotic inspection, information gathering problems, and search and rescue. These optimization schemes compel the robot to spend time in a region proportional to the expected utility of visiting that region. Current methods for ergodic trajectory optimization rely on domain-specific knowledge, e.g., a defined utility map, and well-defined spatial basis functions to produce ergodic trajectories. Here, we present a generalization of ergodic trajectory optimization based on maximum mean discrepancy that requires only samples from the search domain. We demonstrate the ability of our approach to produce coverage trajectories on a variety of problem domains including robotic inspection of objects with differential kinematics constraints and on Lie groups without having access to domain specific knowledge. Furthermore, we show favorable computational scaling compared to existing state-of-the-art methods for ergodic trajectory optimization with a trade-off between domain specific knowledge and computational scaling, thus extending the versatility of ergodic coverage on a wider application domain.
Similarity Learning with neural networks
Sanfins, Gabriel, Ramos, Fabio, Naiff, Danilo
Understanding and predicting the behavior of complex physical systems is a cornerstone of scientific and engineering endeavors. In fluid mechanics, for instance, accurately simulating real operational conditions is essential for the design and optimization of pipelines, aerospace components, and various industrial processes. However, full-scale simulations of such systems are often prohibitively expensive and time-consuming due to the intricate dynamics and vast parameter spaces involved. This poses a significant challenge for researchers and engineers who seek to explore and optimize these systems efficiently. One promising approach to mitigate these challenges is the identification of scaling similarities and symmetry groups within physical systems. By uncovering the correct scaling relations, we can develop smaller, more manageable models that accurately capture the essential behavior of real-world scenarios. These scaled models not only reduce computational costs but also accelerate the design and testing processes by allowing for efficient exploration of the parameter space. Moreover, understanding these scaling laws deepens our insight into the fundamental principles governing these systems, enabling us to generalize findings from simplified models to full-scale applications with greater confidence. In recent years, the application of machine learning in fluid mechanics has been on the rise, offering innovative tools to address complex problems that are difficult to solve analytically.
AutoMate: Specialist and Generalist Assembly Policies over Diverse Geometries
Tang, Bingjie, Akinola, Iretiayo, Xu, Jie, Wen, Bowen, Handa, Ankur, Van Wyk, Karl, Fox, Dieter, Sukhatme, Gaurav S., Ramos, Fabio, Narang, Yashraj
Robotic assembly for high-mixture settings requires adaptivity to diverse parts and poses, which is an open challenge. Meanwhile, in other areas of robotics, large models and sim-to-real have led to tremendous progress. Inspired by such work, we present AutoMate, a learning framework and system that consists of 4 parts: 1) a dataset of 100 assemblies compatible with simulation and the real world, along with parallelized simulation environments for policy learning, 2) a novel simulation-based approach for learning specialist (i.e., part-specific) policies and generalist (i.e., unified) assembly policies, 3) demonstrations of specialist policies that individually solve 80 assemblies with 80% or higher success rates in simulation, as well as a generalist policy that jointly solves 20 assemblies with an 80%+ success rate, and 4) zero-shot sim-to-real transfer that achieves similar (or better) performance than simulation, including on perception-initialized assembly. The key methodological takeaway is that a union of diverse algorithms from manufacturing engineering, character animation, and time-series analysis provides a generic and robust solution for a diverse range of robotic assembly problems.To our knowledge, AutoMate provides the first simulation-based framework for learning specialist and generalist policies over a wide range of assemblies, as well as the first system demonstrating zero-shot sim-to-real transfer over such a range.