Wang, Limin
SPA: 3D Spatial-Awareness Enables Effective Embodied Representation
Zhu, Haoyi, Yang, Honghui, Wang, Yating, Yang, Jiange, Wang, Limin, He, Tong
In this paper, we introduce SPA, a novel representation learning framework that emphasizes the importance of 3D spatial awareness in embodied AI. Our approach leverages differentiable neural rendering on multi-view images to endow a vanilla Vision Transformer (ViT) with intrinsic spatial understanding. We present the most comprehensive evaluation of embodied representation learning to date, covering 268 tasks across 8 simulators with diverse policies in both single-task and language-conditioned multi-task scenarios. The results are compelling: SPA consistently outperforms more than 10 state-of-the-art representation methods, including those specifically designed for embodied AI, vision-centric tasks, and multi-modal applications, while using less training data. Furthermore, we conduct a series of real-world experiments to confirm its effectiveness in practical scenarios. These results highlight the critical role of 3D spatial awareness for embodied representation learning. Our strongest model takes more than 6000 GPU hours to train and we are committed to open-sourcing all code and model weights to foster future research in embodied representation learning. Project Page: https://haoyizhu.github.io/spa/.
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Li, Qingyun, Chen, Zhe, Wang, Weiyun, Wang, Wenhai, Ye, Shenglong, Jin, Zhenjiang, Chen, Guanzhou, He, Yinan, Gao, Zhangwei, Cui, Erfei, Yu, Jiashuo, Tian, Hao, Zhou, Jiasheng, Xu, Chao, Wang, Bin, Wei, Xingjian, Li, Wei, Zhang, Wenjian, Zhang, Bo, Cai, Pinlong, Wen, Licheng, Yan, Xiangchao, Li, Zhenxiang, Chu, Pei, Wang, Yi, Dou, Min, Tian, Changyao, Zhu, Xizhou, Lu, Lewei, Chen, Yushi, He, Junjun, Tu, Zhongying, Lu, Tong, Wang, Yali, Wang, Limin, Lin, Dahua, Qiao, Yu, Shi, Botian, He, Conghui, Dai, Jifeng
Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-level image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research.
VFIMamba: Video Frame Interpolation with State Space Models
Zhang, Guozhen, Liu, Chunxu, Cui, Yutao, Zhao, Xiaotong, Ma, Kai, Wang, Limin
Inter-frame modeling is pivotal in generating intermediate frames for video frame interpolation (VFI). Current approaches predominantly rely on convolution or attention-based models, which often either lack sufficient receptive fields or entail significant computational overheads. Recently, Selective State Space Models (S6) have emerged, tailored specifically for long sequence modeling, offering both linear complexity and data-dependent modeling capabilities. In this paper, we propose VFIMamba, a novel frame interpolation method for efficient and dynamic inter-frame modeling by harnessing the S6 model. Our approach introduces the Mixed-SSM Block (MSB), which initially rearranges tokens from adjacent frames in an interleaved fashion and subsequently applies multi-directional S6 modeling. This design facilitates the efficient transmission of information across frames while upholding linear complexity. Furthermore, we introduce a novel curriculum learning strategy that progressively cultivates proficiency in modeling inter-frame dynamics across varying motion magnitudes, fully unleashing the potential of the S6 model. Experimental findings showcase that our method attains state-of-the-art performance across diverse benchmarks, particularly excelling in high-resolution scenarios. In particular, on the X-TEST dataset, VFIMamba demonstrates a noteworthy improvement of 0.80 dB for 4K frames and 0.96 dB for 2K frames.
Spatiotemporal Predictive Pre-training for Robotic Motor Control
Yang, Jiange, Liu, Bei, Fu, Jianlong, Pan, Bocheng, Wu, Gangshan, Wang, Limin
Robotic motor control necessitates the ability to predict the dynamics of environments and interaction objects. However, advanced self-supervised pre-trained visual representations (PVRs) in robotic motor control, leveraging large-scale egocentric videos, often focus solely on learning the static content features of sampled image frames. This neglects the crucial temporal motion clues in human video data, which implicitly contain key knowledge about sequential interacting and manipulating with the environments and objects. In this paper, we present a simple yet effective robotic motor control visual pre-training framework that jointly performs spatiotemporal prediction with dual decoders, utilizing large-scale video data, termed as \textbf{STP}. STP adheres to two key designs in a multi-task learning manner. First, we perform spatial prediction on the masked current frame for learning content features. Second, we utilize the future frame with an extremely high masking ratio as a condition, based on the masked current frame, to conduct temporal prediction of future frame for capturing motion features. This asymmetric masking and decoder architecture design is very efficient, ensuring that our representation focusing on motion information while capturing spatial details. We carry out the largest-scale BC evaluation of PVRs for robotic motor control to date, which encompasses 21 tasks within a real-world Franka robot arm and 5 simulated environments. Extensive experiments demonstrate the effectiveness of STP as well as unleash its generality and data efficiency by further post-pre-training and hybrid pre-training. Our code and weights will be released for further applications.
VideoChat: Chat-Centric Video Understanding
Li, KunChang, He, Yinan, Wang, Yi, Li, Yizhuo, Wang, Wenhai, Luo, Ping, Wang, Yali, Wang, Limin, Qiao, Yu
In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything
BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
Shi, Fengyuan, Gu, Jiaxi, Xu, Hang, Xu, Songcen, Zhang, Wei, Wang, Limin
Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons. First, it requires huge memory and compute overhead to train a video generation foundation model. Even with video foundation models, additional costly training is still required for downstream video synthesis tasks. Second, although some works extend image diffusion models into videos in a training-free manner, temporal consistency cannot be well kept. Finally, these adaption methods are specifically designed for one task and fail to generalize to different downstream video synthesis tasks. To mitigate these issues, we propose a training-free general-purpose video synthesis framework, coined as BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically, we first use an image diffusion model (like ControlNet, Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion model for temporal smoothing. Decoupling image and video models enables flexible image model selection for different purposes, which endows the framework with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff, we perform a wide range of video generation tasks, including controllable video generation video editing, video inpainting and outpainting. Our project page is available at https://bivdiff.github.io.
Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning
Shi, Fengyuan, Wang, Limin
Despite the success of transformers on various computer vision tasks, they suffer from excessive memory and computational cost. Some works present dynamic vision transformers to accelerate inference by pruning redundant tokens. A key to improving token pruning is using well-trained models as initialization for faster convergence and better performance. However, current base models usually adopt full image training, i.e., using full images as inputs and keeping the whole feature maps through the forward process, which causes inconsistencies with dynamic models that gradually reduce tokens, including calculation pattern, information amount and token selection strategy inconsistencies. Inspired by MAE which performs masking and reconstruction self-supervised task, we devise masked fine-tuning to bridge the gaps between pre-trained base models used for initialization and token pruning based dynamic vision transformers, by masking image patches and predicting the image class label based on left unmasked patches. Extensive experiments on ImageNet demonstrate that base models via masked fine-tuning gain strong occlusion robustness and ability against information loss. With this better initialization, Dynamic ViT achieves higher accuracies, especially under large token pruning ratios (e.g., 81.9% vs. 81.3%, and 62.3% vs. 58.9% for DeiT based Dynamic ViT/0.8 and Dynamic ViT/0.3). Moreover, we apply our method into different token pruning based dynamic vision transformers, different pre-trained models and randomly initialized models to demonstrate the generalization ability.
Transferring Foundation Models for Generalizable Robotic Manipulation
Yang, Jiange, Tan, Wenhui, Jin, Chuhao, Yao, Keling, Liu, Bei, Fu, Jianlong, Song, Ruihua, Wu, Gangshan, Wang, Limin
Improving the generalization capabilities of general-purpose robotic manipulation agents in the real world has long been a significant challenge. Existing approaches often rely on collecting large-scale robotic data which is costly and time-consuming, such as the RT-1 dataset. However, due to insufficient diversity of data, these approaches typically suffer from limiting their capability in open-domain scenarios with new objects, and diverse environments. In this paper, we propose a novel paradigm that effectively leverages language grounded segmentation mask generated by Internet-scale foundation models, to address a wide range of pick-and-place robot manipulation tasks. By integrating the mask modality, which incorporates semantic, geometric, and temporal correlation priors derived from vision foundation models, into the end-to-end policy model, our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning, including new object instances, semantic categories, and unseen backgrounds. We first introduce a series of foundation models to ground natural language demands across multiple tasks. Secondly, we develop a two-stream 2D policy model based on imitation learning, which utilizes raw images, object masks, and robot proprioception to predict robot actions. Extensive real-world experiments conducted on a Franka Emika robot arm demonstrate the effectiveness of our proposed paradigm. Demos are shown in YouTube (https://www.youtube.com/watch?v=MAcUPFBfRIw ).
MGMAE: Motion Guided Masking for Video Masked Autoencoding
Huang, Bingkun, Zhao, Zhiyu, Zhang, Guozhen, Qiao, Yu, Wang, Limin
Masked autoencoding has shown excellent performance on self-supervised video representation learning. Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE. In this paper, we aim to further improve the performance of video masked autoencoding by introducing a motion guided masking strategy. Our key insight is that motion is a general and unique prior in video, which should be taken into account during masked pre-training. Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume. Based on this masking volume, we can track the unmasked tokens in time and sample a set of temporal consistent cubes from videos. These temporal aligned unmasked tokens will further relieve the information leakage issue in time and encourage the MGMAE to learn more useful structure information. We implement our MGMAE with an online efficient optical flow estimator and backward masking map warping strategy. We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE. In addition, we provide the visualization analysis to illustrate that our MGMAE can sample temporal consistent cubes in a motion-adaptive manner for more effective video pre-training.
On Data Imbalance in Molecular Property Prediction with Pre-training
Wang, Limin, Hanai, Masatoshi, Suzumura, Toyotaro, Takashige, Shun, Taura, Kenjiro
Revealing and analyzing the various properties of materials is an essential and critical issue in the development of materials, including batteries, semiconductors, catalysts, and pharmaceuticals. Traditionally, these properties have been determined through theoretical calculations and simulations. However, it is not practical to perform such calculations on every single candidate material. Recently, a combination method of the theoretical calculation and machine learning has emerged, that involves training machine learning models on a subset of theoretical calculation results to construct a surrogate model that can be applied to the remaining materials. On the other hand, a technique called pre-training is used to improve the accuracy of machine learning models. Pre-training involves training the model on pretext task, which is different from the target task, before training the model on the target task. This process aims to extract the input data features, stabilizing the learning process and improving its accuracy. However, in the case of molecular property prediction, there is a strong imbalance in the distribution of input data and features, which may lead to biased learning towards frequently occurring data during pre-training. In this study, we propose an effective pre-training method that addresses the imbalance in input data. We aim to improve the final accuracy by modifying the loss function of the existing representative pre-training method, node masking, to compensate the imbalance. We have investigated and assessed the impact of our proposed imbalance compensation on pre-training and the final prediction accuracy through experiments and evaluations using benchmark of molecular property prediction models.