Chen, Yuntao
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
Hou, Zhi, Zhang, Tianyi, Xiong, Yuwen, Duan, Haonan, Pu, Hengjun, Tong, Ronglei, Zhao, Chengyang, Zhu, Xizhou, Qiao, Yu, Dai, Jifeng, Chen, Yuntao
While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. Departing from prior methods that condition denoising on fused embeddings via shallow networks, Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. This design explicitly models action deltas and environmental nuances. By scaling the diffusion action denoiser alongside the Transformer's scalability, Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces. Such synergy enhances robustness against various variances and facilitates the successful execution of long-horizon tasks. Evaluations across extensive benchmarks demonstrate state-of-the-art or comparative performance in simulation. Notably, Dita achieves robust real-world adaptation to environmental variances and complex long-horizon tasks through 10-shot finetuning, using only third-person camera inputs. The architecture establishes a versatile, lightweight and open-source baseline for generalist robot policy learning. Project Page: https://robodita.github.io.
Diffusion Transformer Policy
Hou, Zhi, Zhang, Tianyi, Xiong, Yuwen, Pu, Hengjun, Zhao, Chengyang, Tong, Ronglei, Qiao, Yu, Dai, Jifeng, Chen, Yuntao
Recent large visual-language action models pretrained on diverse robot datasets have demonstrated the potential for generalizing to new environments with a few in-domain data. However, those approaches usually predict discretized or continuous actions by a small action head, which limits the ability in handling diverse action spaces. In contrast, we model the continuous action with a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, in which we directly denoise action chunks by a large transformer model rather than a small action head. By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets, and achieve better generalization performance. Extensive experiments demonstrate Diffusion Transformer Policy pretrained on diverse robot data can generalize to different embodiments, including simulation environments like Maniskill2 and Calvin, as well as the real-world Franka arm. Specifically, without bells and whistles, the proposed approach achieves state-ofthe-art performance with only a single third-view camera stream in the Calvin novel task setting (ABC D), improving the average number of tasks completed in a row of 5 to 3.6, and the pretraining stage significantly facilitates the success sequence length on the Calvin by over 1.2. The code will be publicly available. Traditional robot learning paradigm usually relies on large-scale data collected for a specific robot and task, but collecting robot data for generalist tasks is time-consuming and expensive due to the limitations of robot hardware in the real world. Nowadays, the foundational models OpenAI (2022; 2023; 2021); Rombach et al. (2021) in Natural Language Process and Computer Vision, pretrained on broad, diverse, task-agnostic datasets, have demonstrated powerful ability in solving downstream tasks either zero-shot or with a few task-specific samples. It is principally possible that a general robot policy exposed to large scale diverse robot datasets improves generalization and performance on downstream tasks Brohan et al. (2022; 2023).
DrivingDojo Dataset: Advancing Interactive and Knowledge-Enriched Driving World Model
Wang, Yuqi, Cheng, Ke, He, Jiawei, Wang, Qitai, Dai, Hengchen, Chen, Yuntao, Xia, Fei, Zhang, Zhaoxiang
Driving world models have gained increasing attention due to their ability to model complex physical dynamics. However, their superb modeling capability is yet to be fully unleashed due to the limited video diversity in current driving datasets. We introduce DrivingDojo, the first dataset tailor-made for training interactive world models with complex driving dynamics. Our dataset features video clips with a complete set of driving maneuvers, diverse multi-agent interplay, and rich open-world driving knowledge, laying a stepping stone for future world model development. We further define an action instruction following (AIF) benchmark for world models and demonstrate the superiority of the proposed dataset for generating action-controlled future predictions.
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Tian, Changyao, Zhu, Xizhou, Xiong, Yuwen, Wang, Weiyun, Chen, Zhe, Wang, Wenhai, Chen, Yuntao, Lu, Lewei, Lu, Tong, Zhou, Jie, Li, Hongsheng, Qiao, Yu, Dai, Jifeng
Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}.
SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models
Li, Hongxin, Su, Jingran, Chen, Yuntao, Li, Qing, Zhang, Zhaoxiang
Computer end users have spent billions of hours completing daily tasks like tabular data processing and project timeline scheduling. Most of these tasks are repetitive and error-prone, yet most end users lack the skill to automate these burdensome works. With the advent of large language models (LLMs), directing software with natural language user requests become a reachable goal. In this work, we propose a SheetCopilot agent that takes natural language task and control spreadsheet to fulfill the requirements. We propose a set of atomic actions as an abstraction of spreadsheet software functionalities. We further design a state machine-based task planning framework for LLMs to robustly interact with spreadsheets. We curate a representative dataset containing 221 spreadsheet control tasks and establish a fully automated evaluation pipeline for rigorously benchmarking the ability of LLMs in software control tasks. Our SheetCopilot correctly completes 44.3\% of tasks for a single generation, outperforming the strong code generation baseline by a wide margin. Our project page:https://sheetcopilot.github.io/.
PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation
Wang, Yuqi, Chen, Yuntao, Liao, Xingyu, Fan, Lue, Zhang, Zhaoxiang
Comprehensive modeling of the surrounding 3D world is key to the success of autonomous driving. However, existing perception tasks like object detection, road structure segmentation, depth & elevation estimation, and open-set object localization each only focus on a small facet of the holistic 3D scene understanding task. This divide-and-conquer strategy simplifies the algorithm development procedure at the cost of losing an end-to-end unified solution to the problem. In this work, we address this limitation by studying camera-based 3D panoptic segmentation, aiming to achieve a unified occupancy representation for camera-only 3D scene understanding. To achieve this, we introduce a novel method called PanoOcc, which utilizes voxel queries to aggregate spatiotemporal information from multi-frame and multi-view images in a coarse-to-fine scheme, integrating feature learning and scene representation into a unified occupancy representation. We have conducted extensive ablation studies to verify the effectiveness and efficiency of the proposed method. Our approach achieves new state-of-the-art results for camera-based semantic segmentation and panoptic segmentation on the nuScenes dataset. Furthermore, our method can be easily extended to dense occupancy prediction and has shown promising performance on the Occ3D benchmark. The code will be released at https://github.com/Robertwyq/PanoOcc.
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory
Zhu, Xizhou, Chen, Yuntao, Tian, Hao, Tao, Chenxin, Su, Weijie, Yang, Chenyu, Huang, Gao, Li, Bin, Lu, Lewei, Wang, Xiaogang, Qiao, Yu, Zhang, Zhaoxiang, Dai, Jifeng
The captivating realm of Minecraft has attracted substantial research interest in recent years, serving as a rich platform for developing intelligent agents capable of functioning in open-world environments. However, the current research landscape predominantly focuses on specific objectives, such as the popular "ObtainDiamond" task, and has not yet shown effective generalization to a broader spectrum of tasks. Furthermore, the current leading success rate for the "ObtainDiamond" task stands at around 20%, highlighting the limitations of Reinforcement Learning (RL) based controllers used in existing methods. To tackle these challenges, we introduce Ghost in the Minecraft (GITM), a novel framework integrates Large Language Models (LLMs) with text-based knowledge and memory, aiming to create Generally Capable Agents (GCAs) in Minecraft. These agents, equipped with the logic and common sense capabilities of LLMs, can skillfully navigate complex, sparse-reward environments with text-based interactions. We develop a set of structured actions and leverage LLMs to generate action plans for the agents to execute. The resulting LLM-based agent markedly surpasses previous methods, achieving a remarkable improvement of +47.5% in success rate on the "ObtainDiamond" task, demonstrating superior robustness compared to traditional RL-based controllers. Notably, our agent is the first to procure all items in the Minecraft Overworld technology tree, demonstrating its extensive capabilities. GITM does not need any GPU for training, but a single CPU node with 32 CPU cores is enough. This research shows the potential of LLMs in developing capable agents for handling long-horizon, complex tasks and adapting to uncertainties in open-world environments. See the project website at https://github.com/OpenGVLab/GITM.
Once Detected, Never Lost: Surpassing Human Performance in Offline LiDAR based 3D Object Detection
Fan, Lue, Yang, Yuxue, Mao, Yiming, Wang, Feng, Chen, Yuntao, Wang, Naiyan, Zhang, Zhaoxiang
This paper aims for high-performance offline LiDAR-based 3D object detection. We first observe that experienced human annotators annotate objects from a track-centric perspective. They first label the objects with clear shapes in a track, and then leverage the temporal coherence to infer the annotations of obscure objects. Drawing inspiration from this, we propose a high-performance offline detector in a track-centric perspective instead of the conventional object-centric perspective. Our method features a bidirectional tracking module and a track-centric learning module. Such a design allows our detector to infer and refine a complete track once the object is detected at a certain moment. We refer to this characteristic as "onCe detecTed, neveR Lost" and name the proposed system CTRL. Extensive experiments demonstrate the remarkable performance of our method, surpassing the human-level annotating accuracy and the previous state-of-the-art methods in the highly competitive Waymo Open Dataset without model ensemble. The code will be made publicly available at https://github.com/tusen-ai/SST.
DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer
Chen, Yuntao (Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences) | Wang, Naiyan (TuSimple) | Zhang, Zhaoxiang (Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences)
We have witnessed rapid evolution of deep neural network architecture design in the past years. These latest progresses greatly facilitate the developments in various areas such as computer vision and natural language processing. However, along with the extraordinary performance, these state-of-the-art models also bring in expensive computational cost. Directly deploying these models into applications with real-time requirement is still infeasible. Recently, Hinton et al. have shown that the dark knowledge within a powerful teacher model can significantly help the training of a smaller and faster student network. These knowledge are vastly beneficial to improve the generalization ability of the student model. Inspired by their work, we introduce a new type of knowledge---cross sample similarities for model compression and acceleration. This knowledge can be naturally derived from deep metric learning model. To transfer them, we bring the "learning to rank" technique into deep metric learning formulation. We test our proposed DarkRank method on various metric learning tasks including pedestrian re-identification, image retrieval and image clustering. The results are quite encouraging. Our method can improve over the baseline method by a large margin. Moreover, it is fully compatible with other existing methods. When combined, the performance can be further boosted.