Zhang, Kaifeng
PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos
Jiang, Hanxiao, Hsu, Hao-Yu, Zhang, Kaifeng, Yu, Hsin-Ni, Wang, Shenlong, Li, Yunzhu
Creating a physical digital twin of a real-world object has immense potential in robotics, content creation, and XR. In this paper, we present PhysTwin, a novel framework that uses sparse videos of dynamic objects under interaction to produce a photo- and physically realistic, real-time interactive virtual replica. Our approach centers on two key components: (1) a physics-informed representation that combines spring-mass models for realistic physical simulation, generative shape models for geometry, and Gaussian splats for rendering; and (2) a novel multi-stage, optimization-based inverse modeling framework that reconstructs complete geometry, infers dense physical properties, and replicates realistic appearance from videos. Our method integrates an inverse physics framework with visual perception cues, enabling high-fidelity reconstruction even from partial, occluded, and limited viewpoints. PhysTwin supports modeling various deformable objects, including ropes, stuffed animals, cloth, and delivery packages. Experiments show that PhysTwin outperforms competing methods in reconstruction, rendering, future prediction, and simulation under novel interactions. We further demonstrate its applications in interactive real-time simulation and model-based robotic motion planning.
Dynamic 3D Gaussian Tracking for Graph-Based Neural Dynamics Modeling
Zhang, Mingtong, Zhang, Kaifeng, Li, Yunzhu
However, existing video prediction approaches typically do not explicitly account for the 3D information from videos, such as robot actions and objects' 3D states, limiting their use in real-world robotic applications. In this work, we introduce a framework to learn object dynamics directly from multi-view RGB videos by explicitly considering the robot's action trajectories and their effects on scene dynamics. We utilize the 3D Gaussian representation of 3D Gaussian Splatting (3DGS) to train a particle-based dynamics model using Graph Neural Networks. This model operates on sparse control particles downsampled from the densely tracked 3D Gaussian reconstructions. By learning the neural dynamics model on offline robot interaction data, our method can predict object motions under varying initial configurations and unseen robot actions. The 3D transformations of Gaussians can be interpolated from the motions of control particles, enabling the rendering of predicted future object states and achieving action-conditioned video prediction. The dynamics model can also be applied to model-based planning frameworks for object manipulation tasks. We conduct experiments on various kinds of deformable materials, including ropes, clothes, and stuffed animals, demonstrating our framework's ability to model complex shapes and dynamics. Our project page is available at https://gs-dynamics.github.io.
AdaptiGraph: Material-Adaptive Graph-Based Neural Dynamics for Robotic Manipulation
Zhang, Kaifeng, Li, Baoyu, Hauser, Kris, Li, Yunzhu
Predictive models are a crucial component of many robotic systems. Yet, constructing accurate predictive models for a variety of deformable objects, especially those with unknown physical properties, remains a significant challenge. This paper introduces AdaptiGraph, a learning-based dynamics modeling approach that enables robots to predict, adapt to, and control a wide array of challenging deformable materials with unknown physical properties. AdaptiGraph leverages the highly flexible graph-based neural dynamics (GBND) framework, which represents material bits as particles and employs a graph neural network (GNN) to predict particle motion. Its key innovation is a unified physical property-conditioned GBND model capable of predicting the motions of diverse materials with varying physical properties without retraining. Upon encountering new materials during online deployment, AdaptiGraph utilizes a physical property optimization process for a few-shot adaptation of the model, enhancing its fit to the observed interaction data. The adapted models can precisely simulate the dynamics and predict the motion of various deformable materials, such as ropes, granular media, rigid boxes, and cloth, while adapting to different physical properties, including stiffness, granular size, and center of pressure. On prediction and manipulation tasks involving a diverse set of real-world deformable objects, our method exhibits superior prediction accuracy and task proficiency over non-material-conditioned and non-adaptive models. The project page is available at https://robopil.github.io/adaptigraph/ .
Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance
Zhang, Kaifeng, Yin, Zhao-Heng, Ye, Weirui, Gao, Yang
Defining reward functions for skill learning has been a long-standing challenge in robotics. Recently, vision-language models (VLMs) have shown promise in defining reward signals for teaching robots manipulation skills. However, existing works often provide reward guidance that is too coarse, leading to inefficient learning processes. In this paper, we address this issue by implementing more fine-grained reward guidance. We decompose tasks into simpler sub-tasks, using this decomposition to offer more informative reward guidance with VLMs. We also propose a VLM-based self imitation learning process to speed up learning. Empirical evidence demonstrates that our algorithm consistently outperforms baselines such as CLIP, LIV, and RoboCLIP. Specifically, our algorithm achieves a $5.4 \times$ higher average success rate compared to the best baseline, RoboCLIP, across a series of manipulation tasks.
Auto-Encoding Adversarial Imitation Learning
Zhang, Kaifeng, Zhao, Rui, Zhang, Ziming, Gao, Yang
Reinforcement learning (RL) provides a powerful framework for decision-making, but its application in practice often requires a carefully designed reward function. Adversarial Imitation Learning (AIL) sheds light on automatic policy acquisition without access to the reward signal from the environment. In this work, we propose Auto-Encoding Adversarial Imitation Learning (AEAIL), a robust and scalable AIL framework. To induce expert policies from demonstrations, AEAIL utilizes the reconstruction error of an auto-encoder as a reward signal, which provides more information for optimizing policies than the prior discriminator-based ones. Subsequently, we use the derived objective functions to train the auto-encoder and the agent policy. Experiments show that our AEAIL performs superior compared to state-of-the-art methods on both state and image based environments. More importantly, AEAIL shows much better robustness when the expert demonstrations are noisy.
Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild
Zhang, Kaifeng, Fu, Yang, Borse, Shubhankar, Cai, Hong, Porikli, Fatih, Wang, Xiaolong
While 6D object pose estimation has wide applications across computer vision and robotics, it remains far from being solved due to the lack of annotations. The problem becomes even more challenging when moving to category-level 6D pose, which requires generalization to unseen instances. Current approaches are restricted by leveraging annotations from simulation or collected from humans. In this paper, we overcome this barrier by introducing a self-supervised learning approach trained directly on large-scale real-world object videos for category-level 6D pose estimation in the wild. Our framework reconstructs the canonical 3D shape of an object category and learns dense correspondences between input images and the canonical shape via surface embedding. For training, we propose novel geometrical cycle-consistency losses which construct cycles across 2D-3D spaces, across different instances and different time steps. The learned correspondence can be applied for 6D pose estimation and other downstream tasks such as keypoint transfer. Surprisingly, our method, without any human annotations or simulators, can achieve on-par or even better performance than previous supervised or semisupervised methods on in-the-wild images. Code and videos are available at https://kywind.github.io/self-pose. Object 6D pose estimation is a long-standing problem for computer vision and robotics. In instancelevel 6D pose estimation, a model is trained to estimate the 6D pose for one single instance given its 3D shape template (He et al., 2020; Xiang et al., 2017; Oberweger et al., 2018). For generalizing to unseen objects and removing the requirement of 3D CAD templates, approaches for category-level 6D pose estimation are proposed (Wang et al., 2019b).