Xu, Chenfeng
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation
Kodaira, Akio, Xu, Chenfeng, Hazama, Toshiki, Yoshimoto, Takanori, Ohno, Kohei, Mitsuhori, Shogo, Sugano, Soichi, Cho, Hanying, Liu, Zhijian, Keutzer, Kurt
We introduce StreamDiffusion, a real-time diffusion pipeline designed for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as Metaverse, live video streaming, and broadcasting, where high throughput is imperative. To address this, we present a novel approach that transforms the original sequential denoising into the batching denoising process. Stream Batch eliminates the conventional wait-and-interact approach and enables fluid and high throughput streams. To handle the frequency disparity between data input and model throughput, we design a novel input-output queue for parallelizing the streaming process. Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG), which requires additional U-Net computation. To mitigate the redundant computations, we propose a novel residual classifier-free guidance (RCFG) algorithm that reduces the number of negative conditional denoising steps to only one or even zero. Besides, we introduce a stochastic similarity filter(SSF) to optimize power consumption. Our Stream Batch achieves around 1.5x speedup compared to the sequential denoising method at different denoising levels. The proposed RCFG leads to speeds up to 2.05x higher than the conventional CFG. Combining the proposed strategies and existing mature acceleration tools makes the image-to-image generation achieve up-to 91.07fps on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers over 59.56x. Furthermore, our proposed StreamDiffusion also significantly reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one RTX4090, respectively.
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Collaboration, Open X-Embodiment, Padalkar, Abhishek, Pooley, Acorn, Mandlekar, Ajay, Jain, Ajinkya, Tung, Albert, Bewley, Alex, Herzog, Alex, Irpan, Alex, Khazatsky, Alexander, Rai, Anant, Singh, Anikait, Garg, Animesh, Brohan, Anthony, Raffin, Antonin, Wahid, Ayzaan, Burgess-Limerick, Ben, Kim, Beomjoon, Schölkopf, Bernhard, Ichter, Brian, Lu, Cewu, Xu, Charles, Finn, Chelsea, Xu, Chenfeng, Chi, Cheng, Huang, Chenguang, Chan, Christine, Pan, Chuer, Fu, Chuyuan, Devin, Coline, Driess, Danny, Pathak, Deepak, Shah, Dhruv, Büchler, Dieter, Kalashnikov, Dmitry, Sadigh, Dorsa, Johns, Edward, Ceola, Federico, Xia, Fei, Stulp, Freek, Zhou, Gaoyue, Sukhatme, Gaurav S., Salhotra, Gautam, Yan, Ge, Schiavi, Giulio, Kahn, Gregory, Su, Hao, Fang, Hao-Shu, Shi, Haochen, Amor, Heni Ben, Christensen, Henrik I, Furuta, Hiroki, Walke, Homer, Fang, Hongjie, Mordatch, Igor, Radosavovic, Ilija, Leal, Isabel, Liang, Jacky, Abou-Chakra, Jad, Kim, Jaehyung, Peters, Jan, Schneider, Jan, Hsu, Jasmine, Bohg, Jeannette, Bingham, Jeffrey, Wu, Jiajun, Wu, Jialin, Luo, Jianlan, Gu, Jiayuan, Tan, Jie, Oh, Jihoon, Malik, Jitendra, Booher, Jonathan, Tompson, Jonathan, Yang, Jonathan, Lim, Joseph J., Silvério, João, Han, Junhyek, Rao, Kanishka, Pertsch, Karl, Hausman, Karol, Go, Keegan, Gopalakrishnan, Keerthana, Goldberg, Ken, Byrne, Kendra, Oslund, Kenneth, Kawaharazuka, Kento, Zhang, Kevin, Rana, Krishan, Srinivasan, Krishnan, Chen, Lawrence Yunliang, Pinto, Lerrel, Fei-Fei, Li, Tan, Liam, Ott, Lionel, Lee, Lisa, Tomizuka, Masayoshi, Spero, Max, Du, Maximilian, Ahn, Michael, Zhang, Mingtong, Ding, Mingyu, Srirama, Mohan Kumar, Sharma, Mohit, Kim, Moo Jin, Kanazawa, Naoaki, Hansen, Nicklas, Heess, Nicolas, Joshi, Nikhil J, Suenderhauf, Niko, Di Palo, Norman, Shafiullah, Nur Muhammad Mahi, Mees, Oier, Kroemer, Oliver, Sanketi, Pannag R, Wohlhart, Paul, Xu, Peng, Sermanet, Pierre, Sundaresan, Priya, Vuong, Quan, Rafailov, Rafael, Tian, Ran, Doshi, Ria, Martín-Martín, Roberto, Mendonca, Russell, Shah, Rutav, Hoque, Ryan, Julian, Ryan, Bustamante, Samuel, Kirmani, Sean, Levine, Sergey, Moore, Sherry, Bahl, Shikhar, Dass, Shivin, Sonawani, Shubham, Song, Shuran, Xu, Sichun, Haldar, Siddhant, Adebola, Simeon, Guist, Simon, Nasiriany, Soroush, Schaal, Stefan, Welker, Stefan, Tian, Stephen, Dasari, Sudeep, Belkhale, Suneel, Osa, Takayuki, Harada, Tatsuya, Matsushima, Tatsuya, Xiao, Ted, Yu, Tianhe, Ding, Tianli, Davchev, Todor, Zhao, Tony Z., Armstrong, Travis, Darrell, Trevor, Jain, Vidhi, Vanhoucke, Vincent, Zhan, Wei, Zhou, Wenxuan, Burgard, Wolfram, Chen, Xi, Wang, Xiaolong, Zhu, Xinghao, Li, Xuanlin, Lu, Yao, Chebotar, Yevgen, Zhou, Yifan, Zhu, Yifeng, Xu, Ying, Wang, Yixuan, Bisk, Yonatan, Cho, Yoonyoung, Lee, Youngwoon, Cui, Yuchen, Wu, Yueh-Hua, Tang, Yujin, Zhu, Yuke, Li, Yunzhu, Iwasawa, Yusuke, Matsuo, Yutaka, Xu, Zhuo, Cui, Zichen Jeff
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website $\href{https://robotics-transformer-x.github.io}{\text{robotics-transformer-x.github.io}}$.
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
Sha, Hao, Mu, Yao, Jiang, Yuxuan, Chen, Li, Xu, Chenfeng, Luo, Ping, Li, Shengbo Eben, Tomizuka, Masayoshi, Zhan, Wei, Ding, Mingyu
Existing learning-based autonomous driving (AD) systems face challenges in comprehending high-level information, generalizing to rare events, and providing interpretability. To address these problems, this work employs Large Language Models (LLMs) as a decision-making component for complex AD scenarios that require human commonsense understanding. We devise cognitive pathways to enable comprehensive reasoning with LLMs, and develop algorithms for translating LLM decisions into actionable driving commands. Through this approach, LLM decisions are seamlessly integrated with low-level controllers by guided parameter matrix adaptation. Extensive experiments demonstrate that our proposed method not only consistently surpasses baseline approaches in single-vehicle tasks, but also helps handle complex driving behaviors even multi-vehicle coordination, thanks to the commonsense reasoning capabilities of LLMs. This paper presents an initial step toward leveraging LLMs as effective decision-makers for intricate AD scenarios in terms of safety, efficiency, generalizability, and interoperability. We aspire for it to serve as inspiration for future research in this field. Imagine you are behind the wheel, approaching an unsignalized intersection and planning to turn left, with an oncoming vehicle straight ahead. Human drivers intuitively know that according to traffic rules, they should slow down and yield, even if it is technically possible to speed through. However, existing advanced learning-based Autonomous Driving (AD) systems typically require complex rules or reward function designs to handle such scenarios effectively (Chen et al., 2023a; Kiran et al., 2022). This reliance on predefined rule bases often limits their ability to generalize to various situations. Another challenge facing existing learning-based AD systems is the long-tail problem (Buhet et al., 2019). Both limited datasets and sampling efficiency (Atakishiyev et al., 2023) can present challenges for existing learning-based AD systems when making decisions in rare real-world driving scenarios. Chauffeurnet (Bansal et al., 2018) demonstrated such limits where even 30 million stateaction samples were insufficient to learn an optimal policy that mapped bird's-eye view images (states) to control (action).
Human-oriented Representation Learning for Robotic Manipulation
Huo, Mingxiao, Ding, Mingyu, Xu, Chenfeng, Tian, Thomas, Zhu, Xinghao, Mu, Yao, Sun, Lingfeng, Tomizuka, Masayoshi, Zhan, Wei
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks. We advocate that such a representation automatically arises from simultaneously learning about multiple simple perceptual skills that are critical for everyday scenarios (e.g., hand detection, state estimate, etc.) and is better suited for learning robot manipulation policies compared to current state-of-the-art visual representations purely based on self-supervised objectives. We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders, where each task is a perceptual skill tied to human-environment interactions. We introduce Task Fusion Decoder as a plug-and-play embedding translator that utilizes the underlying relationships among these perceptual skills to guide the representation learning towards encoding meaningful structure for what's important for all perceptual skills, ultimately empowering learning of downstream robotic manipulation tasks. Extensive experiments across a range of robotic tasks and embodiments, in both simulations and real-world environments, show that our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders including R3M, MVP, and EgoVLP, for downstream manipulation policy-learning. Project page: https://sites.google.com/view/human-oriented-robot-learning
RSRD: A Road Surface Reconstruction Dataset and Benchmark for Safe and Comfortable Autonomous Driving
Zhao, Tong, Xu, Chenfeng, Ding, Mingyu, Tomizuka, Masayoshi, Zhan, Wei, Wei, Yintao
This paper addresses the growing demands for safety and comfort in intelligent robot systems, particularly autonomous vehicles, where road conditions play a pivotal role in overall driving performance. For example, reconstructing road surfaces helps to enhance the analysis and prediction of vehicle responses for motion planning and control systems. We introduce the Road Surface Reconstruction Dataset (RSRD), a real-world, high-resolution, and high-precision dataset collected with a specialized platform in diverse driving conditions. It covers common road types containing approximately 16,000 pairs of stereo images, original point clouds, and ground-truth depth/disparity maps, with accurate post-processing pipelines to ensure its quality. Based on RSRD, we further build a comprehensive benchmark for recovering road profiles through depth estimation and stereo matching. Preliminary evaluations with various state-of-the-art methods reveal the effectiveness of our dataset and the challenge of the task, underscoring substantial opportunities of RSRD as a valuable resource for advancing techniques, e.g., multi-view stereo towards safe autonomous driving. The dataset and demo videos are available at https://thu-rsxd.com/rsrd/
Image2Point: 3D Point-Cloud Understanding with Pretrained 2D ConvNets
Xu, Chenfeng, Yang, Shijia, Zhai, Bohan, Wu, Bichen, Yue, Xiangyu, Zhan, Wei, Vajda, Peter, Keutzer, Kurt, Tomizuka, Masayoshi
3D point-clouds and 2D images are different visual representations of the physical world. While human vision can understand both representations, computer vision models designed for 2D image and 3D point-cloud understanding are quite different. Our paper investigates the potential for transferability between these two representations by empirically investigating whether this approach works, what factors affect the transfer performance, and how to make it work even better. We discovered that we can indeed use the same neural net model architectures to understand both images and point-clouds. Moreover, we can transfer pretrained weights from image models to point-cloud models with minimal effort. Specifically, based on a 2D ConvNet pretrained on an image dataset, we can transfer the image model to a point-cloud model by \textit{inflating} 2D convolutional filters to 3D then finetuning its input, output, and optionally normalization layers. The transferred model can achieve competitive performance on 3D point-cloud classification, indoor and driving scene segmentation, even beating a wide range of point-cloud models that adopt task-specific architectures and use a variety of tricks.