AITopics | Xiong, Yuwen

Collaborating Authors

Xiong, Yuwen

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Hou, Zhi, Zhang, Tianyi, Xiong, Yuwen, Duan, Haonan, Pu, Hengjun, Tong, Ronglei, Zhao, Chengyang, Zhu, Xizhou, Qiao, Yu, Dai, Jifeng, Chen, Yuntao

arXiv.org Artificial IntelligenceMar-25-2025

While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. Departing from prior methods that condition denoising on fused embeddings via shallow networks, Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. This design explicitly models action deltas and environmental nuances. By scaling the diffusion action denoiser alongside the Transformer's scalability, Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces. Such synergy enhances robustness against various variances and facilitates the successful execution of long-horizon tasks. Evaluations across extensive benchmarks demonstrate state-of-the-art or comparative performance in simulation. Notably, Dita achieves robust real-world adaptation to environmental variances and complex long-horizon tasks through 10-shot finetuning, using only third-person camera inputs. The architecture establishes a versatile, lightweight and open-source baseline for generalist robot policy learning. Project Page: https://robodita.github.io.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2503.19757

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Diffusion Transformer Policy

Hou, Zhi, Zhang, Tianyi, Xiong, Yuwen, Pu, Hengjun, Zhao, Chengyang, Tong, Ronglei, Qiao, Yu, Dai, Jifeng, Chen, Yuntao

arXiv.org Artificial IntelligenceOct-21-2024

Recent large visual-language action models pretrained on diverse robot datasets have demonstrated the potential for generalizing to new environments with a few in-domain data. However, those approaches usually predict discretized or continuous actions by a small action head, which limits the ability in handling diverse action spaces. In contrast, we model the continuous action with a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, in which we directly denoise action chunks by a large transformer model rather than a small action head. By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets, and achieve better generalization performance. Extensive experiments demonstrate Diffusion Transformer Policy pretrained on diverse robot data can generalize to different embodiments, including simulation environments like Maniskill2 and Calvin, as well as the real-world Franka arm. Specifically, without bells and whistles, the proposed approach achieves state-ofthe-art performance with only a single third-view camera stream in the Calvin novel task setting (ABC D), improving the average number of tasks completed in a row of 5 to 3.6, and the pretraining stage significantly facilitates the success sequence length on the Calvin by over 1.2. The code will be publicly available. Traditional robot learning paradigm usually relies on large-scale data collected for a specific robot and task, but collecting robot data for generalist tasks is time-consuming and expensive due to the limitations of robot hardware in the real world. Nowadays, the foundational models OpenAI (2022; 2023; 2021); Rombach et al. (2021) in Natural Language Process and Computer Vision, pretrained on broad, diverse, task-agnostic datasets, have demonstrated powerful ability in solving downstream tasks either zero-shot or with a few task-specific samples. It is principally possible that a general robot policy exposed to large scale diverse robot datasets improves generalization and performance on downstream tasks Brohan et al. (2022; 2023).

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.15959

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)

Add feedback

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

Tian, Changyao, Zhu, Xizhou, Xiong, Yuwen, Wang, Weiyun, Chen, Zhe, Wang, Wenhai, Chen, Yuntao, Lu, Lewei, Lu, Tong, Zhou, Jie, Li, Hongsheng, Qiao, Yu, Dai, Jifeng

arXiv.org Artificial IntelligenceJan-18-2024

Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2401.10208

Country:

Asia > China (0.28)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion

Zhang, Lunjun, Xiong, Yuwen, Yang, Ze, Casas, Sergio, Hu, Rui, Urtasun, Raquel

arXiv.org Artificial IntelligenceJan-16-2024

Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. When applied to learning world models on point cloud observations, our model reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotic agents. Figure 1: Our unsupervised world model can produce accurate near-term 1s predictions and diverse multi-future 3s predictions directly on the level of point cloud observations. World models explicitly represent the knowledge of an autonomous agent about its environment. They are defined as a generative model that predicts the next observation in an environment given past observations and the current action. Such a generative model can learn from any unlabeled agent experience, and can be used for both learning and planning in the model-based reinforcement learning framework (Sutton, 1991). This approach has excelled in domains such as Atari (Kaiser et al., 2019), robotic manipulation (Nagabandi et al., 2020), and Minecraft (Hafner et al., 2023). Learning world models can be viewed as a special case of sequence modeling on agent experience. While Generative Pre-trained Transformers (GPT) (Brown et al., 2020) have enabled rapid progress Prediction systems in autonomous driving still require supervised learning, either on the level of bounding boxes (Luo et al., 2018), semantic segmentation (Sadat et al., 2020), or instance segmentation (Hu et al., 2021). However, just as GPT learns to understand language via next token prediction, if a world model can predict unlabeled future observations really well, it must have developed a general understanding of the scene including geometry and dynamics.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2311.01017

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Germany (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Transportation > Ground > Road (0.81)
Information Technology > Robotics & Automation (0.81)
Automobiles & Trucks (0.81)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Towards Unsupervised Object Detection From LiDAR Point Clouds

Zhang, Lunjun, Yang, Anqi Joyce, Xiong, Yuwen, Casas, Sergio, Yang, Bin, Ren, Mengye, Urtasun, Raquel

arXiv.org Artificial IntelligenceNov-3-2023

In this paper, we study the problem of unsupervised object detection from 3D point clouds in self-driving scenes. We present a simple yet effective method that exploits (i) point clustering in near-range areas where the point clouds are dense, (ii) temporal consistency to filter out noisy unsupervised detections, (iii) translation equivariance of CNNs to extend the auto-labels to long range, and (iv) self-supervision for improving on its own. Our approach, OYSTER (Object Discovery via Spatio-Temporal Refinement), does not impose constraints on data collection (such as repeated traversals of the same location), is able to detect objects in a zero-shot manner without supervised finetuning (even in sparse, distant regions), and continues to self-improve given more rounds of iterative self-training. To better measure model performance in self-driving scenarios, we propose a new planning-centric perception metric based on distance-to-collision. We demonstrate that our unsupervised object detector significantly outperforms unsupervised baselines on PandaSet and Argoverse 2 Sensor dataset, showing promise that self-supervision combined with object priors can enable object discovery in the wild. For more information, visit the project website: https://waabi.ai/research/oyster

artificial intelligence, detection, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2311.02007

Country:

Asia (1.00)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre:

Research Report (0.82)
Press Release (0.54)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

LabelFormer: Object Trajectory Refinement for Offboard Perception from LiDAR Point Clouds

Yang, Anqi Joyce, Casas, Sergio, Dvornik, Nikita, Segal, Sean, Xiong, Yuwen, Hu, Jordan Sir Kwang, Fang, Carter, Urtasun, Raquel

arXiv.org Artificial IntelligenceNov-2-2023

A major bottleneck to scaling-up training of self-driving perception systems are the human annotations required for supervision. A promising alternative is to leverage "auto-labelling" offboard perception models that are trained to automatically generate annotations from raw LiDAR point clouds at a fraction of the cost. Auto-labels are most commonly generated via a two-stage approach -- first objects are detected and tracked over time, and then each object trajectory is passed to a learned refinement model to improve accuracy. Since existing refinement models are overly complex and lack advanced temporal reasoning capabilities, in this work we propose LabelFormer, a simple, efficient, and effective trajectory-level refinement approach. Our approach first encodes each frame's observations separately, then exploits self-attention to reason about the trajectory with full temporal context, and finally decodes the refined object size and per-frame poses. Evaluation on both urban and highway datasets demonstrates that LabelFormer outperforms existing works by a large margin. Finally, we show that training on a dataset augmented with auto-labels generated by our method leads to improved downstream detection performance compared to existing methods. Please visit the project website for details https://waabi.ai/labelformer

artificial intelligence, machine learning, mean iou, (16 more...)

arXiv.org Artificial Intelligence

2311.01444

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

UltraLiDAR: Learning Compact Representations for LiDAR Completion and Generation

Xiong, Yuwen, Ma, Wei-Chiu, Wang, Jingkang, Urtasun, Raquel

arXiv.org Artificial IntelligenceNov-2-2023

LiDAR provides accurate geometric measurements of the 3D world. Unfortunately, dense LiDARs are very expensive and the point clouds captured by low-beam LiDAR are often sparse. To address these issues, we present UltraLiDAR, a data-driven framework for scene-level LiDAR completion, LiDAR generation, and LiDAR manipulation. The crux of UltraLiDAR is a compact, discrete representation that encodes the point cloud's geometric structure, is robust to noise, and is easy to manipulate. We show that by aligning the representation of a sparse point cloud to that of a dense point cloud, we can densify the sparse point clouds as if they were captured by a real high-density LiDAR, drastically reducing the cost. Furthermore, by learning a prior over the discrete codebook, we can generate diverse, realistic LiDAR point clouds for self-driving. We evaluate the effectiveness of UltraLiDAR on sparse-to-dense LiDAR completion and LiDAR generation. Experiments show that densifying real-world point clouds with our approach can significantly improve the performance of downstream perception systems. Compared to prior art on LiDAR generation, our approach generates much more realistic point clouds. According to A/B test, over 98.5\% of the time human participants prefer our results over those of previous methods.

artificial intelligence, machine learning, point cloud, (16 more...)

arXiv.org Artificial Intelligence

2311.01448

Country:

North America > United States (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Transportation > Ground > Road (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Adv3D: Generating Safety-Critical 3D Objects through Closed-Loop Simulation

Sarva, Jay, Wang, Jingkang, Tu, James, Xiong, Yuwen, Manivasagam, Sivabalan, Urtasun, Raquel

arXiv.org Artificial IntelligenceNov-2-2023

Self-driving vehicles (SDVs) must be rigorously tested on a wide range of scenarios to ensure safe deployment. The industry typically relies on closed-loop simulation to evaluate how the SDV interacts on a corpus of synthetic and real scenarios and verify it performs properly. However, they primarily only test the system's motion planning module, and only consider behavior variations. It is key to evaluate the full autonomy system in closed-loop, and to understand how variations in sensor data based on scene appearance, such as the shape of actors, affect system performance. In this paper, we propose a framework, Adv3D, that takes real world scenarios and performs closed-loop sensor simulation to evaluate autonomy performance, and finds vehicle shapes that make the scenario more challenging, resulting in autonomy failures and uncomfortable SDV maneuvers. Unlike prior works that add contrived adversarial shapes to vehicle roof-tops or roadside to harm perception only, we optimize a low-dimensional shape representation to modify the vehicle shape itself in a realistic manner to degrade autonomy performance (e.g., perception, prediction, and motion planning). Moreover, we find that the shape variations found with Adv3D optimized in closed-loop are much more effective than those in open-loop, demonstrating the importance of finding scene appearance variations that affect autonomy in the interactive setting.

artificial intelligence, conference, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2311.01446

Country:

Asia > Middle East > Israel (0.14)
North America > United States (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.82)

Industry:

Transportation > Ground > Road (1.00)
Energy > Renewable > Geothermal > Geothermal Energy Systems and Facilities > Geothermal System for Power Generation > Advanced Geothermal System (AGS) (1.00)
Automobiles & Trucks (1.00)
Information Technology (0.95)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback

Rethinking Closed-loop Training for Autonomous Driving

Zhang, Chris, Guo, Runsheng, Zeng, Wenyuan, Xiong, Yuwen, Dai, Binbin, Hu, Rui, Ren, Mengye, Urtasun, Raquel

arXiv.org Artificial IntelligenceJun-27-2023

Recent advances in high-fidelity simulators have enabled closed-loop training of autonomous driving agents, potentially solving the distribution shift in training v.s. deployment and allowing training to be scaled both safely and cheaply. However, there is a lack of understanding of how to build effective training benchmarks for closed-loop training. In this work, we present the first empirical study which analyzes the effects of different training benchmark designs on the success of learning agents, such as how to design traffic scenarios and scale training environments. Furthermore, we show that many popular RL algorithms cannot achieve satisfactory performance in the context of autonomous driving, as they lack long-term planning and take an extremely long time to train. To address these issues, we propose trajectory value learning (TRAVL), an RL-based driving agent that performs planning with multistep look-ahead and exploits cheaply generated imagined data for efficient learning. Our experiments show that TRAVL can learn much faster and produce safer maneuvers compared to all the baselines. For more information, visit the project website: https://waabi.ai/research/travl

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-19842-7_16

2306.15713

Country:

North America > United States (0.28)
North America > Canada > Ontario > Toronto (0.14)

Genre:

Press Release (0.54)
Research Report (0.50)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)
Energy > Renewable > Geothermal > Geothermal Energy Systems and Facilities > Geothermal System for Power Generation > Advanced Geothermal System (AGS) (0.84)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Cost-Efficient Online Hyperparameter Optimization

Wang, Jingkang, Ren, Mengye, Bogunovic, Ilija, Xiong, Yuwen, Urtasun, Raquel

arXiv.org Machine LearningJan-16-2021

Recent work on hyperparameters optimization (HPO) has shown the possibility of training certain hyperparameters together with regular parameters. However, these online HPO algorithms still require running evaluation on a set of validation examples at each training step, steeply increasing the training cost. To decide when to query the validation loss, we model online HPO as a time-varying Bayesian optimization problem, on top of which we propose a novel \textit{costly feedback} setting to capture the concept of the query cost. Under this setting, standard algorithms are cost-inefficient as they evaluate on the validation set at every round. In contrast, the cost-efficient GP-UCB algorithm proposed in this paper queries the unknown function only when the model is less confident about current decisions. We evaluate our proposed algorithm by tuning hyperparameters online for VGG and ResNet on CIFAR-10 and ImageNet100. Our proposed online HPO algorithm reaches human expert-level performance within a single run of the experiment, while incurring only modest computational overhead compared to regular training.

big data, optimization, optimization problem, (19 more...)

arXiv.org Machine Learning

2101.0659

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
Information Technology > Data Science > Data Mining > Big Data (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

Add feedback