Liu, Songming
ManiBox: Enhancing Spatial Grasping Generalization via Scalable Simulation Data Generation
Tan, Hengkai, Xu, Xuezhou, Ying, Chengyang, Mao, Xinyi, Liu, Songming, Zhang, Xingxing, Su, Hang, Zhu, Jun
Learning a precise robotic grasping policy is crucial for embodied agents operating in complex real-world manipulation tasks. Despite significant advancements, most models still struggle with accurate spatial positioning of objects to be grasped. We first show that this spatial generalization challenge stems primarily from the extensive data requirements for adequate spatial understanding. However, collecting such data with real robots is prohibitively expensive, and relying on simulation data often leads to visual generalization gaps upon deployment. To overcome these challenges, we then focus on state-based policy generalization and present \textbf{ManiBox}, a novel bounding-box-guided manipulation method built on a simulation-based teacher-student framework. The teacher policy efficiently generates scalable simulation data using bounding boxes, which are proven to uniquely determine the objects' spatial positions. The student policy then utilizes these low-dimensional spatial states to enable zero-shot transfer to real robots. Through comprehensive evaluations in simulated and real-world environments, ManiBox demonstrates a marked improvement in spatial grasping generalization and adaptability to diverse objects and backgrounds. Further, our empirical study into scaling laws for policy performance indicates that spatial volume generalization scales with data volume in a power law. For a certain level of spatial volume, the success rate of grasping empirically follows Michaelis-Menten kinetics relative to data volume, showing a saturation effect as data increases. Our videos and code are available in https://thkkk.github.io/manibox.
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Liu, Songming, Wu, Lingxuan, Li, Bangguo, Tan, Hengkai, Chen, Huayu, Wang, Zhengyi, Xu, Ke, Su, Hang, Zhu, Jun
Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zeroshot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1 5 demonstrations, and effectively handles complex, dexterous tasks. We refer to the project page for the code and videos. Bimanual manipulation is essential for robots to accomplish real-world tasks (Edsinger & Kemp, 2007). For practical applications, a useful manipulation policy should be able to generalize to unseen scenarios, such as unseen objects and scenes. Following the success in natural language processing (Achiam et al., 2023; Touvron et al., 2023) and computer vision (Radford et al., 2021; Kirillov et al., 2023), one promising direction to enable generalizable behaviors is to develop a foundation model through imitation learning on large-scale datasets. However, it is highly non-trivial to develop a bimanual manipulation foundation model. One main reason is that the accessible data for a specific dual-arm robot is significantly scarce (Sharma et al., 2018; Collaboration et al., 2023) due to high hardware costs, undermining the data-intensive requirements of training foundational models. Inspired by recent attempts in unimanual manipulation (Brohan et al., 2023; Kim et al., 2024), we seek to first pre-train on extensive multi-robot datasets and then fine-tune on the small dataset collected on the target dual-arm robot. This can help us to scale the data size up to three orders of magnitude, having the potential to learn transferrable physics knowledge from datasets of other robots. Nevertheless, there are two key technical challenges.
Fourier Controller Networks for Real-Time Decision-Making in Embodied Learning
Tan, Hengkai, Liu, Songming, Ma, Kai, Ying, Chengyang, Zhang, Xingxing, Su, Hang, Zhu, Jun
Transformer has shown promise in reinforcement learning to model time-varying features for obtaining generalized low-level robot policies on diverse robotics datasets in embodied learning. However, it still suffers from the issues of low data efficiency and high inference latency. In this paper, we propose to investigate the task from a new perspective of the frequency domain. We first observe that the energy density in the frequency domain of a robot's trajectory is mainly concentrated in the low-frequency part. Then, we present the Fourier Controller Network (FCNet), a new network that uses Short-Time Fourier Transform (STFT) to extract and encode time-varying features through frequency domain interpolation. In order to do real-time decision-making, we further adopt FFT and Sliding DFT methods in the model architecture to achieve parallel training and efficient recurrent inference. Extensive results in both simulated (e.g., D4RL) and real-world environments (e.g., robot locomotion) demonstrate FCNet's substantial efficiency and effectiveness over existing methods such as Transformer, e.g., FCNet outperforms Transformer on multi-environmental robotics datasets of all types of sizes (from 1.9M to 120M). The project page and code can be found https://thkkk.github.io/fcnet.
Reference Neural Operators: Learning the Smooth Dependence of Solutions of PDEs on Geometric Deformations
Cheng, Ze, Hao, Zhongkai, Wang, Xiaoqiang, Huang, Jianing, Wu, Youjia, Liu, Xudan, Zhao, Yiru, Liu, Songming, Su, Hang
For partial differential equations on domains of arbitrary shapes, existing works of neural operators attempt to learn a mapping from geometries to solutions. It often requires a large dataset of geometry-solution pairs in order to obtain a sufficiently accurate neural operator. However, for many industrial applications, e.g., engineering design optimization, it can be prohibitive to satisfy the requirement since even a single simulation may take hours or days of computation. To address this issue, we propose reference neural operators (RNO), a novel way of implementing neural operators, i.e., to learn the smooth dependence of solutions on geometric deformations. Specifically, given a reference solution, RNO can predict solutions corresponding to arbitrary deformations of the referred geometry. This approach turns out to be much more data efficient. Through extensive experiments, we show that RNO can learn the dependence across various types and different numbers of geometry objects with relatively small datasets. RNO outperforms baseline models in accuracy by a large lead and achieves up to 80% error reduction.
DPOT: Auto-Regressive Denoising Operator Transformer for Large-Scale PDE Pre-Training
Hao, Zhongkai, Su, Chang, Liu, Songming, Berner, Julius, Ying, Chengyang, Su, Hang, Anandkumar, Anima, Song, Jian, Zhu, Jun
Pre-training has been investigated to improve the efficiency and performance of training neural operators in data-scarce settings. However, it is largely in its infancy due to the inherent complexity and diversity, such as long trajectories, multiple scales and varying dimensions of partial differential equations (PDEs) data. In this paper, we present a new auto-regressive denoising pre-training strategy, which allows for more stable and efficient pre-training on PDE data and generalizes to various downstream tasks. Moreover, by designing a flexible and scalable model architecture based on Fourier attention, we can easily scale up the model for large-scale pre-training. We train our PDE foundation model with up to 0.5B parameters on 10+ PDE datasets with more than 100k trajectories. Extensive experiments show that we achieve SOTA on these benchmarks and validate the strong generalizability of our model to significantly enhance performance on diverse downstream PDE tasks like 3D data. Code is available at \url{https://github.com/thu-ml/DPOT}.
Task Aware Dreamer for Task Generalization in Reinforcement Learning
Ying, Chengyang, Hao, Zhongkai, Zhou, Xinning, Su, Hang, Liu, Songming, Yan, Dong, Zhu, Jun
A long-standing goal of reinforcement learning is to acquire agents that can learn on training tasks and generalize well on unseen tasks that may share a similar dynamic but with different reward functions. The ability to generalize across tasks is important as it determines an agent's adaptability to real-world scenarios where reward mechanisms might vary. In this work, we first show that training a general world model can utilize similar structures in these tasks and help train more generalizable agents. Extending world models into the task generalization setting, we introduce a novel method named Task Aware Dreamer (TAD), which integrates reward-informed features to identify consistent latent characteristics across tasks. Within TAD, we compute the variational lower bound of sample data log-likelihood, which introduces a new term designed to differentiate tasks using their states, as the optimization objective of our reward-informed world models. To demonstrate the advantages of the reward-informed policy in TAD, we introduce a new metric called Task Distribution Relevance (TDR) which quantitatively measures the relevance of different tasks. For tasks exhibiting a high TDR, i.e., the tasks differ significantly, we illustrate that Markovian policies struggle to distinguish them, thus it is necessary to utilize reward-informed policies in TAD. Extensive experiments in both image-based and state-based tasks show that TAD can significantly improve the performance of handling different tasks simultaneously, especially for those with high TDR, and display a strong generalization ability to unseen tasks.
Preconditioning for Physics-Informed Neural Networks
Liu, Songming, Su, Chang, Yao, Jiachen, Hao, Zhongkai, Su, Hang, Wu, Youjia, Zhu, Jun
Physics-informed neural networks (PINNs) have shown promise in solving various partial differential equations (PDEs). However, training pathologies have negatively affected the convergence and prediction accuracy of PINNs, which further limits their practical applications. In this paper, we propose to use condition number as a metric to diagnose and mitigate the pathologies in PINNs. Inspired by classical numerical analysis, where the condition number measures sensitivity and stability, we highlight its pivotal role in the training dynamics of PINNs. We prove theorems to reveal how condition number is related to both the error control and convergence of PINNs. Subsequently, we present an algorithm that leverages preconditioning to improve the condition number. Evaluations of 18 PDE problems showcase the superior performance of our method. Significantly, in 7 of these problems, our method reduces errors by an order of magnitude. These empirical findings verify the critical role of the condition number in PINNs' training.
PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEs
Hao, Zhongkai, Yao, Jiachen, Su, Chang, Su, Hang, Wang, Ziao, Lu, Fanzhi, Xia, Zeyu, Zhang, Yichi, Liu, Songming, Lu, Lu, Zhu, Jun
While significant progress has been made on Physics-Informed Neural Networks (PINNs), a comprehensive comparison of these methods across a wide range of Partial Differential Equations (PDEs) is still lacking. This study introduces PINNacle, a benchmarking tool designed to fill this gap. PINNacle provides a diverse dataset, comprising over 20 distinct PDEs from various domains, including heat conduction, fluid dynamics, biology, and electromagnetics. These PDEs encapsulate key challenges inherent to real-world problems, such as complex geometry, multi-scale phenomena, nonlinearity, and high dimensionality. PINNacle also offers a user-friendly toolbox, incorporating about 10 state-of-the-art PINN methods for systematic evaluation and comparison. We have conducted extensive experiments with these methods, offering insights into their strengths and weaknesses. In addition to providing a standardized means of assessing performance, PINNacle also offers an in-depth analysis to guide future research, particularly in areas such as domain decomposition methods and loss reweighting for handling multi-scale problems and complex geometry. To the best of our knowledge, it is the largest benchmark with a diverse and comprehensive evaluation that will undoubtedly foster further research in PINNs.
GNOT: A General Neural Operator Transformer for Operator Learning
Hao, Zhongkai, Wang, Zhengyi, Su, Hang, Ying, Chengyang, Dong, Yinpeng, Liu, Songming, Cheng, Ze, Song, Jian, Zhu, Jun
Learning partial differential equations' (PDEs) solution operators is an essential problem in machine learning. However, there are several challenges for learning operators in practical applications like the irregular mesh, multiple input functions, and complexity of the PDEs' solution. To address these challenges, we propose a general neural operator transformer (GNOT), a scalable and effective transformer-based framework for learning operators. By designing a novel heterogeneous normalized attention layer, our model is highly flexible to handle multiple input functions and irregular meshes. Besides, we introduce a geometric gating mechanism which could be viewed as a soft domain decomposition to solve the multi-scale problems. The large model capacity of the transformer architecture grants our model the possibility to scale to large datasets and practical problems. We conduct extensive experiments on multiple challenging datasets from different domains and achieve a remarkable improvement compared with alternative methods. Our code and data are publicly available at \url{https://github.com/thu-ml/GNOT}.
MultiAdam: Parameter-wise Scale-invariant Optimizer for Multiscale Training of Physics-informed Neural Networks
Yao, Jiachen, Su, Chang, Hao, Zhongkai, Liu, Songming, Su, Hang, Zhu, Jun
Therefore, it has attracted an increasing amount of attention to combine Physics-informed Neural Networks (PINNs) have machine learning techniques for solving PDEs. Physicsinformed recently achieved remarkable progress in solving Neural Network (PINN) (Raissi et al., 2019) is Partial Differential Equations (PDEs) in various one of the representative approaches that approximate solutions fields by minimizing a weighted sum of PDE loss by training neural networks to minimize a weighted and boundary loss. However, there are several sum of PDE loss and boundary loss -- the former is induced critical challenges in the training of PINNs, including from differential equations while the latter is induced the lack of theoretical frameworks and from boundary and initial conditions. PINN has shown the imbalance between PDE loss and boundary its effectiveness in various sophisticated cases, which has loss. In this paper, we present an analysis of been applied in various fields including fluids mechanics second-order non-homogeneous PDEs, which are (Raissi et al., 2020; Sun et al., 2020), and bio-engineering classified into three categories and applicable to (Sahli Costabal et al., 2020; Kissas et al., 2020).