Goto

Collaborating Authors

 retrieval module


URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding

Shi, Yongxin, Wang, Jiapeng, Shan, Zeyu, Peng, Dezhi, Lin, Zening, Jin, Lianwen

arXiv.org Artificial Intelligence

Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the reasoning process, facilitating efficient long document understanding. To this end, we propose URaG, a simple-yet-effective framework that Unifies Retrieval and Generation within a single MLLM. URaG introduces a lightweight cross-modal retrieval module that converts the early Transformer layers into an efficient evidence selector, identifying and preserving the most relevant pages while discarding irrelevant content. This design enables the deeper layers to concentrate computational resources on pertinent information, improving both accuracy and efficiency. Extensive experiments demonstrate that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%. The code is available at https://github.com/shi-yx/URaG.


PISA: A Pragmatic Psych-Inspired Unified Memory System for Enhanced AI Agency

Jia, Shian, Huang, Ziyang, Wang, Xinbo, Zhang, Haofei, Song, Mingli

arXiv.org Artificial Intelligence

Memory systems are fundamental to AI agents, yet existing work often lacks adaptability to diverse tasks and overlooks the constructive and task-oriented role of AI agent memory. Drawing from Piaget's theory of cognitive development, we propose PISA, a pragmatic, psych-inspired unified memory system that addresses these limitations by treating memory as a constructive and adaptive process. To enable continuous learning and adaptability, PISA introduces a trimodal adaptation mechanism (i.e., schema updation, schema evolution, and schema creation) that preserves coherent organization while supporting flexible memory updates. Building on these schema-grounded structures, we further design a hybrid memory access architecture that seamlessly integrates symbolic reasoning with neural retrieval, significantly improving retrieval accuracy and efficiency. Our empirical evaluation, conducted on the existing LOCOMO benchmark and our newly proposed AggQA benchmark for data analysis tasks, confirms that PISA sets a new state-of-the-art by significantly enhancing adaptability and long-term knowledge retention.


Implicit State Estimation via Video Replanning

Ko, Po-Chen, Mao, Jiayuan, Fu, Yu-Hsiang, Yeh, Hsien-Jeng, Chen, Chu-Rong, Ma, Wei-Chiu, Du, Yilun, Sun, Shao-Hua

arXiv.org Artificial Intelligence

Video-based representations have gained prominence in planning and decision-making due to their ability to encode rich spatiotemporal dynamics and geometric relationships. These representations enable flexible and generalizable solutions for complex tasks such as object manipulation and navigation. However, existing video planning frameworks often struggle to adapt to failures at interaction time due to their inability to reason about uncertainties in partially observed environments. To overcome these limitations, we introduce a novel framework that integrates interaction-time data into the planning process. Our approach updates model parameters online and filters out previously failed plans during generation. This enables implicit state estimation, allowing the system to adapt dynamically without explicitly modeling unknown state variables. We evaluate our framework through extensive experiments on a new simulated manipulation benchmark, demonstrating its ability to improve replanning performance and advance the field of video-based decision-making. Learning from videos has gained significant traction in decision-making, as videos capture rich visual and dynamic information while aligning with how humans acquire knowledge. These properties make them a powerful medium for specifying tasks and learning diverse skills across contexts. Recent work has shown the effectiveness of video-based frameworks in enabling robots to learn behaviors such as object manipulation (Li et al., 2024) and navigation (Zhang et al., 2024), highlighting the value of video as a flexible and expressive representation. This paper focuses on video as a planning representation. Given a goal and current observation, video planning systems generate imagined task executions and convert them into robot actions. Unlike symbolic or latent representations, videos naturally encode both perceptual and action information and generalize across tasks and environments. Prior works (Chang et al., 2020; Du et al., 2024a;b) leverage these properties to train universal agents using video-based predictions. Despite promising results, existing video planning frameworks suffer from a crucial limitation: they lack mechanisms to integrate past interactions with the environment and cannot effectively reason about uncertainty due to partial observability.


AlphaApollo: Orchestrating Foundation Models and Professional Tools into a Self-Evolving System for Deep Agentic Reasoning

Zhou, Zhanke, Cao, Chentao, Feng, Xiao, Li, Xuan, Li, Zongze, Lu, Xiangyu, Yao, Jiangchao, Huang, Weikai, Xu, Linrui, Cheng, Tian, Jiang, Guanyu, Zheng, Yiming, Miranda, Brando, Liu, Tongliang, Koyejo, Sanmi, Sugiyama, Masashi, Han, Bo

arXiv.org Artificial Intelligence

We present AlphaApollo, a self-evolving agentic reasoning system that aims to address two bottlenecks in foundation model (FM) reasoning-limited model-intrinsic capacity and unreliable test-time iteration. AlphaApollo orchestrates multiple models with professional tools to enable deliberate, verifiable reasoning. It couples (i) a computation tool (Python with numerical and symbolic libraries) and (ii) a retrieval tool (task-relevant external information) to execute exact calculations and ground decisions. The system further supports multi-round, multi-model solution evolution via a shared state map that records candidates, executable checks, and feedback for iterative refinement. In evaluations on AIME 2024/2025 across multiple models, AlphaApollo delivers consistent gains: +5.15% Average@32 and +23.34% Pass@32 for Qwen2.5-14B-Instruct, and +8.91% Average@32 with +26.67% Pass@32 for Llama-3.3-70B-Instruct. Tool-use analysis shows that more than 80% of tool calls are successfully executed, with consistent outperformance of non-tool baselines, thereby lifting the capability ceiling of FMs. More empirical results and implementation details will be updated at https://github.com/tmlr-group/AlphaApollo.


BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software

Zhang, Zehua, Bajaj, Ati Priya, Handa, Divij, Liu, Siyu, Raj, Arvind S, Chen, Hongkai, Wang, Hulin, Liu, Yibo, Basque, Zion Leonahenahe, Nath, Souradip, Juneja, Vishal, Chapre, Nikhil, Shoshitaishvili, Yan, Doupé, Adam, Baral, Chitta, Wang, Ruoyu

arXiv.org Artificial Intelligence

Automatically compiling open-source software (OSS) projects is a vital, labor-intensive, and complex task, which makes it a good challenge for LLM Agents. Existing methods rely on manually curated rules and workflows, which cannot adapt to OSS that requires customized configuration or environment setup. Recent attempts using Large Language Models (LLMs) used selective evaluation on a subset of highly rated OSS, a practice that underestimates the realistic challenges of OSS compilation. In practice, compilation instructions are often absent, dependencies are undocumented, and successful builds may even require patching source files or modifying build scripts. We propose a more challenging and realistic benchmark, BUILD-BENCH, comprising OSS that are more diverse in quality, scale, and characteristics. Furthermore, we propose a strong baseline LLM-based agent, OSS-BUILD-AGENT, an effective system with enhanced build instruction retrieval module that achieves state-of-the-art performance on BUILD-BENCH and is adaptable to heterogeneous OSS characteristics. We also provide detailed analysis regarding different compilation method design choices and their influence to the whole task, offering insights to guide future advances. We believe performance on BUILD-BENCH can faithfully reflect an agent's ability to tackle compilation as a complex software engineering tasks, and, as such, our benchmark will spur innovation with a significant impact on downstream applications in the fields of software development and software security.


MMAPG: A Training-Free Framework for Multimodal Multi-hop Question Answering via Adaptive Planning Graphs

Hu, Yiheng, Wang, Xiaoyang, Liu, Qing, Xu, Xiwei, Fu, Qian, Zhang, Wenjie, Zhu, Liming

arXiv.org Artificial Intelligence

Multimodal Multi-hop question answering requires integrating information from diverse sources, such as images and texts, to derive answers. Existing methods typically rely on sequential retrieval and reasoning, where each step builds on the previous output. However, this single-path paradigm makes them vulnerable to errors due to misleading intermediate steps. Moreover, developing multimodal models can be computationally expensive, often requiring extensive training. To address these limitations, we propose a training-free framework guided by an Adaptive Planning Graph, which consists of planning, retrieval and reasoning modules. The planning module analyzes the current state of the Adaptive Planning Graph, determines the next action and where to expand the graph, which enables dynamic and flexible exploration of reasoning paths. To handle retrieval of text to unspecified target modalities, we devise modality-specific strategies that dynamically adapt to distinct data types. Our approach preserves the characteristics of multimodal information without costly task-specific training, enabling seamless integration with up-to-date models. Finally, the experiments on MultimodalQA and WebQA show that our approach matches or outperforms existing models that rely on training.


HyFedRAG: A Federated Retrieval-Augmented Generation Framework for Heterogeneous and Privacy-Sensitive Data

Qian, Cheng, Zhang, Hainan, Tong, Yongxin, Zheng, Hong-Wei, Zheng, Zhiming

arXiv.org Artificial Intelligence

Centralized RAG pipelines struggle with heterogeneous and privacy-sensitive data, especially in distributed healthcare settings where patient data spans SQL, knowledge graphs, and clinical notes. Clinicians face difficulties retrieving rare disease cases due to privacy constraints and the limitations of traditional cloud-based RAG systems in handling diverse formats and edge devices. To address this, we introduce HyFedRAG, a unified and efficient Federated RAG framework tailored for Hybrid data modalities. By leveraging an edge-cloud collaborative mechanism, HyFedRAG enables RAG to operate across diverse data sources while preserving data privacy. Our key contributions are: (1) We design an edge-cloud collaborative RAG framework built on Flower, which supports querying structured SQL data, semi-structured knowledge graphs, and unstructured documents. The edge-side LLMs convert diverse data into standardized privacy-preserving representations, and the server-side LLMs integrates them for global reasoning and generation. (2) We integrate lightweight local retrievers with privacy-aware LLMs and provide three anonymization tools that enable each client to produce semantically rich, de-identified summaries for global inference across devices. (3) To optimize response latency and reduce redundant computation, we design a three-tier caching strategy consisting of local cache, intermediate representation cache, and cloud inference cache. Experimental results on PMC-Patients demonstrate that HyFedRAG outperforms existing baselines in terms of retrieval quality, generation consistency, and system efficiency. Our framework offers a scalable and privacy-compliant solution for RAG over structural-heterogeneous data, unlocking the potential of LLMs in sensitive and diverse data environments.


DCMI: A Differential Calibration Membership Inference Attack Against Retrieval-Augmented Generation

Gao, Xinyu, Meng, Xiangtao, Dong, Yingkai, Li, Zheng, Guo, Shanqing

arXiv.org Artificial Intelligence

While Retrieval-Augmented Generation (RAG) effectively reduces hallucinations by integrating external knowledge bases, it introduces vulnerabilities to membership inference attacks (MIAs), particularly in systems handling sensitive data. Existing MIAs targeting RAG's external databases often rely on model responses but ignore the interference of non-member-retrieved documents on RAG outputs, limiting their effectiveness. To address this, we propose DCMI, a differential calibration MIA that mitigates the negative impact of non-member-retrieved documents. Specifically, DCMI leverages the sensitivity gap between member and non-member retrieved documents under query perturbation. It generates perturbed queries for calibration to isolate the contribution of member-retrieved documents while minimizing the interference from non-member-retrieved documents. Experiments under progressively relaxed assumptions show that DCMI consistently outperforms baselines--for example, achieving 97.42% AUC and 94.35% Accuracy against the RAG system with Flan-T5, exceeding the MBA baseline by over 40%. Furthermore, on real-world RAG platforms such as Dify and MaxKB, DCMI maintains a 10%-20% advantage over the baseline. These results highlight significant privacy risks in RAG systems and emphasize the need for stronger protection mechanisms. We appeal to the community's consideration of deeper investigations, like ours, against the data leakage risks in rapidly evolving RAG systems.


Intent Aware Context Retrieval for Multi-Turn Agricultural Question Answering

Vijayvargia, Abhay, Nagpal, Ajay, Pundalik, Kundeshwar, Savarkar, Atharva, Gautam, Smita, Singh, Pankaj, Saluja, Rohit, Ramakrishnan, Ganesh

arXiv.org Artificial Intelligence

Indian farmers often lack timely, accessible, and language-friendly agricultural advice, especially in rural areas with low literacy. To address this gap in accessibility, this paper presents a novel AI-powered agricultural chatbot, Krishi Sathi, designed to support Indian farmers by providing personalized, easy-to-understand answers to their queries through both text and speech. The system's intelligence stems from an IFT model, subsequently refined through fine-tuning on Indian agricultural knowledge across three curated datasets. Unlike traditional chatbots that respond to one-off questions, Krishi Sathi follows a structured, multi-turn conversation flow to gradually collect the necessary details from the farmer, ensuring the query is fully understood before generating a response. Once the intent and context are extracted, the system performs Retrieval-Augmented Generation (RAG) by first fetching information from a curated agricultural database and then generating a tailored response using the IFT model. The chatbot supports both English and Hindi languages, with speech input and output features (via ASR and TTS) to make it accessible for users with low literacy or limited digital skills. This work demonstrates how combining intent-driven dialogue flows, instruction-tuned models, and retrieval-based generation can improve the quality and accessibility of digital agricultural support in India. This approach yielded strong results, with the system achieving a query response accuracy of 97.53%, 91.35% contextual relevance and personalization, and a query completion rate of 97.53%. The average response time remained under 6 seconds, ensuring timely support for users across both English and Hindi interactions.


TypeTele: Releasing Dexterity in Teleoperation by Dexterous Manipulation Types

Lin, Yuhao, Wei, Yi-Lin, Liao, Haoran, Lin, Mu, Xing, Chengyi, Li, Hao, Zhang, Dandan, Cutkosky, Mark, Zheng, Wei-Shi

arXiv.org Artificial Intelligence

Dexterous teleoperation plays a crucial role in robotic manipulation for real-world data collection and remote robot control. Previous dexterous teleoperation mostly relies on hand retargeting to closely mimic human hand postures. However, these approaches may fail to fully leverage the inherent dexterity of dexterous hands, which can execute unique actions through their structural advantages compared to human hands. To address this limitation, we propose TypeTele, a type-guided dexterous teleoperation system, which enables dexterous hands to perform actions that are not constrained by human motion patterns. This is achieved by introducing dexterous manipulation types into the teleoperation system, allowing operators to employ appropriate types to complete specific tasks. To support this system, we build an extensible dexterous manipulation type library to cover comprehensive dexterous postures used in manipulation tasks. During teleoperation, we employ a MLLM (Multi-modality Large Language Model)-assisted type retrieval module to identify the most suitable manipulation type based on the specific task and operator commands. Extensive experiments of real-world teleoperation and imitation learning demonstrate that the incorporation of manipulation types significantly takes full advantage of the dexterous robot's ability to perform diverse and complex tasks with higher success rates.