navigator
DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
Zhu, Dawei, Meng, Rui, Chen, Jiefeng, Li, Sujian, Pfister, Tomas, Yoon, Jinsung
Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.
- North America > United States > Texas > Schleicher County (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Designing Beyond Language: Sociotechnical Barriers in AI Health Technologies for Limited English Proficiency
Huang, Michelle, Rodriguez, Violeta J., Saha, Koustuv, August, Tal
Limited English proficiency (LEP) patients in the U.S. face systemic barriers to healthcare beyond language and interpreter access, encompassing procedural and institutional constraints. AI advances may support communication and care through on-demand translation and visit preparation, but also risk exacerbating existing inequalities. We conducted storyboard-driven interviews with 14 patient navigators to explore how AI could shape care experiences for Spanish-speaking LEP individuals. We identified tensions around linguistic and cultural misunderstandings, privacy concerns, and opportunities and risks for AI to augment care workflows. Participants highlighted structural factors that can undermine trust in AI systems, including sensitive information disclosure, unstable technology access, and low digital literacy. While AI tools can potentially alleviate social barriers and institutional constraints, there are risks of misinformation and uprooting human camaraderie. Our findings contribute design considerations for AI that support LEP patients and care teams via rapport-building, education, and language support, and minimizing disruptions to existing practices.
- North America > United States > Illinois > Champaign County > Urbana (0.14)
- North America > Guatemala (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (16 more...)
- Research Report > New Finding (1.00)
- Questionnaire & Opinion Survey (1.00)
Surfer 2: The Next Generation of Cross-Platform Computer Use Agents
Andreux, Mathieu, Bakler, Märt, Barbier, Yanael, Benchekroun, Hamza, Biré, Emilien, Bonnet, Antoine, Bordie, Riaz, Bout, Nathan, Brunel, Matthias, Cambray, Aleix, Cedoz, Pierre-Louis, Chassang, Antoine, Cloix, Gautier, Connelly, Ethan, Constantinou, Alexandra, De Coster, Ramzi, de la Jonquiere, Hubert, Delfosse, Aurélien, Delpit, Maxime, Deprez, Alexis, Derupti, Augustin, Diaz, Mathieu, D'Souza, Shannon, Dujardin, Julie, Edmund, Abai, Eickenberg, Michael, Fatalot, Armand, Felissi, Wissem, Herring, Isaac, Koegler, Xavier, de Kergaradec, Erwan Le Jumeau, Lac, Aurélien, Langevin, Maxime, Lauverjat, Corentin, Loison, Antonio, Manevich, Avshalom, Moyal, Axel, Kerbel, Axel Nguyen, Parovic, Marinela, Revelle, Julien, Richard, Guillaume, Richter, Mats, Riochet, Ronan, Santos, María, Savidan, Romain, Sifre, Laurent, Theillard, Maxime, Thibault, Marc, Valentini, Ivan, Wu, Tony, Yie, Laura, Yuan, Kai, Zubovskij, Jevgenij
Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Communications > Mobile (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control
Lu, Quanfeng, Ma, Zhantao, Zhong, Shuai, Wang, Jin, Yu, Dahai, Ng, Michael K., Luo, Ping
The rapid advancement of large vision language models (L VLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current L VLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.
BadNAVer: Exploring Jailbreak Attacks On Vision-and-Language Navigation
Lyu, Wenqi, Li, Zerui, Qiao, Yanyuan, Wu, Qi
Multimodal large language models (MLLMs) have recently gained attention for their generalization and reasoning capabilities in Vision-and-Language Navigation (VLN) tasks, leading to the rise of MLLM-driven navigators. However, MLLMs are vulnerable to jailbreak attacks, where crafted prompts bypass safety mechanisms and trigger undesired outputs. In embodied scenarios, such vulnerabilities pose greater risks: unlike plain text models that generate toxic content, embodied agents may interpret malicious instructions as executable commands, potentially leading to real-world harm. In this paper, we present the first systematic jailbreak attack paradigm targeting MLLM-driven navigator. We propose a three-tiered attack framework and construct malicious queries across four intent categories, concatenated with standard navigation instructions. In the Matterport3D simulator, we evaluate navigation agents powered by five MLLMs and report an average attack success rate over 90%. To test real-world feasibility, we replicate the attack on a physical robot. Our results show that even well-crafted prompts can induce harmful actions and intents in MLLMs, posing risks beyond toxic output and potentially leading to physical harm.
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation
Shi, Xiangyu, Li, Zerui, Lyu, Wenqi, Xia, Jiatong, Dayoub, Feras, Qiao, Yanyuan, Wu, Qi
Vision-and-Language Navigation (VLN) in continuous environments requires agents to interpret natural language instructions while navigating unconstrained 3D spaces. Existing VLN-CE frameworks rely on a two-stage approach: a waypoint predictor to generate waypoints and a navigator to execute movements. However, current waypoint predictors struggle with spatial awareness, while navigators lack historical reasoning and backtracking capabilities, limiting adaptability. We propose a zero-shot VLN-CE framework integrating an enhanced waypoint predictor with a Multi-modal Large Language Model (MLLM)-based navigator. Our predictor employs a stronger vision encoder, masked cross-attention fusion, and an occupancy-aware loss for better waypoint quality. The navigator incorporates history-aware reasoning and adaptive path planning with backtracking, improving robustness. Experiments on R2R-CE and MP3D benchmarks show our method achieves state-of-the-art (SOTA) performance in zero-shot settings, demonstrating competitive results compared to fully supervised methods. Real-world validation on Turtlebot 4 further highlights its adaptability.
LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning
Yang, Hanqing, Chen, Jingdi, Siew, Marie, Lorido-Botran, Tania, Joe-Wong, Carlee
Developing intelligent agents for long-term cooperation in dynamic open-world scenarios is a major challenge in multi-agent systems. Traditional Multi-agent Reinforcement Learning (MARL) frameworks like centralized training decentralized execution (CTDE) struggle with scalability and flexibility. They require centralized long-term planning, which is difficult without custom reward functions, and face challenges in processing multi-modal data. CTDE approaches also assume fixed cooperation strategies, making them impractical in dynamic environments where agents need to adapt and plan independently. To address decentralized multi-agent cooperation, we propose Decentralized Adaptive Knowledge Graph Memory and Structured Communication System (DAMCS) in a novel Multi-agent Crafter environment. Our generative agents, powered by Large Language Models (LLMs), are more scalable than traditional MARL agents by leveraging external knowledge and language for long-term planning and reasoning. Instead of fully sharing information from all past experiences, DAMCS introduces a multi-modal memory system organized as a hierarchical knowledge graph and a structured communication protocol to optimize agent cooperation. This allows agents to reason from past interactions and share relevant information efficiently. Experiments on novel multi-agent open-world tasks show that DAMCS outperforms both MARL and LLM baselines in task efficiency and collaboration. Compared to single-agent scenarios, the two-agent scenario achieves the same goal with 63% fewer steps, and the six-agent scenario with 74% fewer steps, highlighting the importance of adaptive memory and structured communication in achieving long-term goals. We publicly release our project at: https://happyeureka.github.io/damcs.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Asia > Singapore (0.04)
- Leisure & Entertainment > Games > Computer Games (0.67)
- Materials > Metals & Mining > Diamonds (0.45)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
Wang, Zun, Li, Jialu, Hong, Yicong, Li, Songze, Li, Kunchang, Yu, Shoubin, Wang, Yi, Qiao, Yu, Wang, Yali, Bansal, Mohit, Wang, Limin
Creating high-quality data for training robust language-instructed agents is a longlasting challenge in embodied AI. In this paper, we introduce a Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale navigational instruction-trajectory pairs by iteratively refining the data pool through the collaboration between two models, the instruction generator and the navigator, without any human-in-the-loop annotation. Specifically, SRDF starts with using a base generator to create an initial data pool for training a base navigator, followed by applying the trained navigator to filter the data pool. This leads to higher-fidelity data to train a better generator, which can, in turn, produce higher-quality data for training the next-round navigator. Such a flywheel establishes a data selfrefining process, yielding a continuously improved and highly effective dataset for large-scale language-guided navigation learning. Our experiments demonstrate that after several flywheel rounds, the navigator elevates the performance boundary from 70% to 78% SPL on the classic R2R test set, surpassing human performance (76%) for the first time. Meanwhile, this process results in a superior generator, evidenced by a SPICE increase from 23.5 to 26.2, better than all previous VLN instruction generation methods. Finally, we demonstrate the scalability of our method through increasing environment and instruction diversity, and the generalization ability of our pre-trained navigator across various downstream navigation tasks, surpassing state-of-the-art methods by a large margin in all cases. Figure 1: (a) Our Pipeline: After using the (instruction) generator to label paths for data augmentation in navigator training, we leverage the trained navigator to filter high-quality data to train a better generator, and the improved generator refines the data pool to train a stronger navigator, iteratively running on the flywheel. It also surpasses human performance on R2R and approaches human-level results on RxR-English and CVDN (for other tasks, human performance is not reported in their paper). The R2R result is from the test set, while others are from val unseen. The lack of high-quality data is one of the main bottlenecks in training embodied agents to complete real-world human activities. Unlike many other discriminative or generative learning problems, where the data itself naturally formulates a self-supervised learning objective (Devlin, 2018; He et al., 2022) or the data labeling can be facilitated by existing models (Ros et al., 2016; Tian et al., 2024), training embodied agents usually requires expensive human annotation on complex visionlinguistic contents and physical interactions.
- Asia > China > Shanghai > Shanghai (0.04)
- North America > Dominican Republic (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (2 more...)
To Ask or Not to Ask? Detecting Absence of Information in Vision and Language Navigation
Abraham, Savitha Sam, Garg, Sourav, Dayoub, Feras
Recent research in Vision Language Navigation (VLN) has overlooked the development of agents' inquisitive abilities, which allow them to ask clarifying questions when instructions are incomplete. This paper addresses how agents can recognize "when" they lack sufficient information, without focusing on "what" is missing, particularly in VLN tasks with vague instructions. Equipping agents with this ability enhances efficiency by reducing potential digressions and seeking timely assistance. The challenge in identifying such uncertain points is balancing between being overly cautious (high recall) and overly confident (high precision). W e propose an attention-based instruction-vagueness estimation module that learns associations between instructions and the agent's trajectory. By leveraging instruction-to-path alignment information during training, the module's vagueness estimation performance improves by around 52% in terms of precision-recall balance. In our ablative experiments, we also demonstrate the effectiveness of incorporating this additional instruction-to-path attention network alongside the cross-modal attention networks within the navigator module. Our results show that the attention scores from the instruction-to-path attention network serve as better indicators for estimating vagueness.