Zhang, Shiduo
World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning
Wang, Siyin, Fei, Zhaoye, Cheng, Qinyuan, Zhang, Shiduo, Cai, Panpan, Fu, Jinlan, Qiu, Xipeng
Recent advances in large vision-language models (LVLMs) have shown promise for embodied task planning, yet they struggle with fundamental challenges like dependency constraints and efficiency. Existing approaches either solely optimize action selection or leverage world models during inference, overlooking the benefits of learning to model the world as a way to enhance planning capabilities. We propose Dual Preference Optimization (D$^2$PO), a new learning framework that jointly optimizes state prediction and action selection through preference learning, enabling LVLMs to understand environment dynamics for better planning. To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial-and-error. Extensive experiments on VoTa-Bench demonstrate that our D$^2$PO-based method significantly outperforms existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B), achieving superior task success rates with more efficient execution paths.
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
Zhang, Shiduo, Xu, Zhe, Liu, Peiju, Yu, Xiaopeng, Li, Yuan, Gao, Qinghui, Fei, Zhaoye, Yin, Zhangyue, Wu, Zuxuan, Jiang, Yu-Gang, Qiu, Xipeng
General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision-Language-Action models (VLAs) have shown a substantial potential to solve language-conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long-horizon tasks demanding multi-step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh\&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high-quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that both the current state-of-the-art pretrained VLAs and the workflow based on VLMs face challenges in our tasks.
Large Trajectory Models are Scalable Motion Predictors and Planners
Sun, Qiao, Zhang, Shiduo, Ma, Danjiao, Shi, Jingzhe, Li, Derun, Luo, Simian, Wang, Yu, Xu, Ningyi, Cao, Guangzhi, Zhao, Hang
Motion prediction and planning are vital tasks in autonomous driving, and recent efforts have shifted to machine learning-based approaches. The challenges include understanding diverse road topologies, reasoning traffic dynamics over a long time horizon, interpreting heterogeneous behaviors, and generating policies in a large continuous state space. Inspired by the success of large language models in addressing similar complexities through model scaling, we introduce a scalable trajectory model called State Transformer (STR). Our approach unites trajectory generation problems with other sequence modeling problems, powering rapid iterations with breakthroughs in neighbor domains such as language modeling. Remarkably, experimental results reveal that large trajectory models (LTMs), such as STR, adhere to the scaling laws by presenting outstanding adaptability and learning efficiency. Qualitative results further demonstrate that LTMs are capable of making plausible predictions in scenarios that diverge significantly from the training data distribution. LTMs also learn to make complex reasonings for long-term planning, without explicit loss designs or costly high-level annotations. Motion planning and prediction in autonomous driving rely on the ability to semantically understand complex driving environments and interactions between various road users. Learning-based methods are pivotal to overcoming this complexity as rule-based and scenario-specific strategies often prove inadequate to cover all possible situations and unexpected events that may occur during operations. Such learning problems can be regarded as conditional sequence-to-sequence tasks, where models leverage past trajectories to generate future ones, depending on the observations. Notably, these problems share structural similarities with other sequence modeling problems, such as language generation. Recent studies (Mirchandani et al., 2023; Zeng et al., 2023) have demonstrated that the LLMs excel not only in natural language generation but also in tackling a wide range of sequence modeling problems and time series forecasting challenges. Building on these insights, prior research (Chen et al., 2021; Janner et al., 2021; Sun et al., 2023) have effectively utilized conditional causal transformers to address motion planning as a large sequence modeling problem, with both behavior cloning and reinforcement learning. Furthermore, (Brohan et al., 2023) replace the transformer backbone with language models, demonstrating the potential to merge motion planning along with other modalities within one large sequence for LLMs.