current observation
Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation
Wei, Meng, Wan, Chenyang, Peng, Jiaqi, Yu, Xiqian, Yang, Yuqiang, Feng, Delin, Cai, Wenzhe, Zhu, Chenming, Wang, Tai, Pang, Jiangmiao, Liu, Xihui
While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.
- Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Hong Kong (0.04)
Hyper-GoalNet: Goal-Conditioned Manipulation Policy Learning with HyperNetworks
Zhou, Pei, Yao, Wanting, Luo, Qian, Zhou, Xunzhe, Yang, Yanchao
Goal-conditioned policy learning for robotic manipulation presents significant challenges in maintaining performance across diverse objectives and environments. We introduce Hyper-GoalNet, a framework that generates task-specific policy network parameters from goal specifications using hypernetworks. Unlike conventional methods that simply condition fixed networks on goal-state pairs, our approach separates goal interpretation from state processing -- the former determines network parameters while the latter applies these parameters to current observations. To enhance representation quality for effective policy generation, we implement two complementary constraints on the latent space: (1) a forward dynamics model that promotes state transition predictability, and (2) a distance-based constraint ensuring monotonic progression toward goal states. We evaluate our method on a comprehensive suite of manipulation tasks with varying environmental randomization. Results demonstrate significant performance improvements over state-of-the-art methods, particularly in high-variability conditions. Real-world robotic experiments further validate our method's robustness to sensor noise and physical uncertainties. Code is available at: https://github.com/wantingyao/hyper-goalnet.
- Asia > China > Hong Kong (0.04)
- North America > United States > Pennsylvania (0.04)
- Education (0.67)
- Health & Medicine > Therapeutic Area (0.46)
MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming
Wang, Shuo, Wang, Yongcai, Fan, Zhaoxin, Wang, Yucheng, Chen, Maiyue, Wang, Kaihui, Su, Zhizhong, Li, Wanting, Cai, Xudong, Jin, Yeying, Li, Deying
Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.
- North America > Dominican Republic (0.04)
- Asia > Singapore (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > California > San Mateo County > Menlo Park (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback
Hu, Minda, Fang, Tianqing, Zhang, Jianshu, Ma, Junyu, Zhang, Zhisong, Zhou, Jingyan, Zhang, Hongming, Mi, Haitao, Yu, Dong, King, Irwin
Web agents powered by Large Language Models (LLMs) show promise for next-generation AI, but their limited reasoning in uncertain, dynamic web environments hinders robust deployment. In this paper, we identify key reasoning skills essential for effective web agents, i.e., reflection & lookahead, branching, and rollback, and curate trajectory data that exemplifies these abilities by reconstructing the agent's (inference-time) reasoning algorithms into chain-of-thought rationales. We conduct experiments in the agent self-improving benchmark, OpenWebVoyager, and demonstrate that distilling salient reasoning patterns into the backbone LLM via simple fine-tuning can substantially enhance its performance. Our approach yields significant improvements across multiple benchmarks, including WebVoyager, Mind2web-live, and SimpleQA (web search), highlighting the potential of targeted reasoning skill enhancement for web agents.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > China > Hong Kong (0.04)
- (5 more...)
- Media (0.46)
- Information Technology (0.46)
RSRNav: Reasoning Spatial Relationship for Image-Goal Navigation
Qin, Zheng, Wang, Le, Wang, Yabing, Zhou, Sanping, Hua, Gang, Tang, Wei
Recent image-goal navigation (ImageNav) methods learn a perception-action policy by separately capturing semantic features of the goal and egocentric images, then passing them to a policy network. However, challenges remain: (1) Semantic features often fail to provide accurate directional information, leading to superfluous actions, and (2) performance drops significantly when viewpoint inconsistencies arise between training and application. To address these challenges, we propose RSRNav, a simple yet effective method that reasons spatial relationships between the goal and current observations as navigation guidance. Specifically, we model the spatial relationship by constructing correlations between the goal and current observations, which are then passed to the policy network for action prediction. These correlations are progressively refined using fine-grained cross-correlation and direction-aware correlation for more precise navigation. Extensive evaluation of RSRNav on three benchmark datasets demonstrates superior navigation performance, particularly in the "user-matched goal" setting, highlighting its potential for real-world applications.
- Asia > China > Shaanxi Province > Xi'an (0.05)
- North America > United States > Illinois > Cook County > Evanston (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (6 more...)
- Education (0.46)
- Health & Medicine (0.46)
- Information Technology > Artificial Intelligence > Robots (0.94)
- Information Technology > Artificial Intelligence > Vision (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.35)
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Zhu, Chuning, Yu, Raymond, Feng, Siyuan, Burchfiel, Benjamin, Shah, Paarth, Gupta, Abhishek
--Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator . Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/ Imitation learning provides a simple and scalable way to imbue autonomous robots with complex behaviors using human demonstrations [6, 26, 11, 49]. Imitation learning via supervised learning, often referred to as behavior cloning (BC), has shown remarkable success due to the advent of powerful multimodal generative models such as diffusion [22] or flow-based models [28]. With these methods, acquiring new behaviors amounts to collecting demonstrations and fitting a generative model to the action distributions given observations. However, despite showing robust and reliable behavior within the training distribution, these methods can be brittle when tasked beyond this distribution. A natural solution to synthesizing robust and generalizable controllers with imitation learning is to scale up the number of high-quality, on-robot demonstrations collected through robotic teleoperation. But this data scaling process is expensive and time-consuming.
- North America > United States (0.04)
- Europe > Italy > Sardinia (0.04)
Zero-Shot Iterative Formalization and Planning in Partially Observable Environments
Gong, Liancheng, Zhu, Wang, Thomason, Jesse, Zhang, Li
Using LLMs not to predict plans but to formalize an environment into the Planning Domain Definition Language (PDDL) has been shown to improve performance and control. Existing work focuses on fully observable environments; we tackle the more realistic and challenging partially observable environments that lack of complete, reliable information. We propose PDDLego+, a framework to iteratively formalize, plan, grow, and refine PDDL representations in a zero-shot manner, without needing access to any existing trajectories. On two textual simulated environments, we show that PDDLego+ improves goal reaching success and exhibits robustness against problem complexity. We also show that the domain knowledge captured after a successful trial can benefit future tasks.
- North America > United States > California (0.14)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Europe > United Kingdom > Scotland > City of Aberdeen > Aberdeen (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
IDEA:Enhancing the Rule Learning Ability of Language Agents through Induction, Deduction, and Abduction
While large language models (LLMs) have been thoroughly evaluated for deductive and inductive reasoning, their proficiency in abductive reasoning and holistic rule learning in interactive environments remains less explored. This work introduces RULEARN, a novel benchmark specifically designed to assess the rule-learning ability of LLMs in interactive settings. In RULEARN, agents interact with the environment to gather observations and discern patterns, using these insights to solve problems. To further enhance the rule-learning capabilities of LLM agents within this benchmark, we propose IDEA agent, which integrates Induction, Deduction, and Abduction processes. IDEA agent refines this approach by leveraging a structured reasoning sequence: generating hypotheses through abduction, testing them via deduction, and refining them based on feedback from induction. This sequence enables agents to dynamically establish and apply rules, mimicking human-like reasoning processes. Our evaluation of five representative LLMs indicates that while these models can generate plausible initial hypotheses, they often struggle with strategic interaction within the environment, effective incorporation of feedback, and adaptive refinement of their hypotheses. IDEA agent demonstrates significantly improved performance on the RULEARN benchmark, offering valuable insights for the development of agents capable of human-like rule-learning in real-world scenarios. We will release our code and data.
- North America > United States > Texas (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Association Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)
EPD: Long-term Memory Extraction, Context-awared Planning and Multi-iteration Decision @ EgoPlan Challenge ICML 2024
Shi, Letian, Lv, Qi, Deng, Xiang, Nie, Liqiang
In this technical report, we present our solution for the EgoPlan Challenge in ICML 2024. To address the real-world egocentric task planning problem, we introduce a novel planning framework which comprises three stages: long-term memory Extraction, context-awared Planning, and multi-iteration Decision, named EPD. Given the task goal, task progress, and current observation, the extraction model first extracts task-relevant memory information from the progress video, transforming the complex long video into summarized memory information. The planning model then combines the context of the memory information with fine-grained visual information from the current observation to predict the next action. Finally, through multi-iteration decision-making, the decision model comprehensively understands the task situation and current state to make the most realistic planning decision. On the EgoPlan-Test set, EPD achieves a planning accuracy of 53.85% over 1,584 egocentric task planning questions. We have made all codes available at https://github.com/Kkskkkskr/EPD .
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.57)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.35)