Goto

Collaborating Authors

 Overview


DocVXQA: Context-Aware Visual Explanations for Document Question Answering

arXiv.org Artificial Intelligence

We propose DocVXQA, a novel framework for visually self-explainable document question answering. The framework is designed not only to produce accurate answers to questions but also to learn visual heatmaps that highlight contextually critical regions, thereby offering interpretable justifications for the model's decisions. To integrate explanations into the learning process, we quantitatively formulate explainability principles as explicit learning objectives. Unlike conventional methods that emphasize only the regions pertinent to the answer, our framework delivers explanations that are \textit{contextually sufficient} while remaining \textit{representation-efficient}. This fosters user trust while achieving a balance between predictive performance and interpretability in DocVQA applications. Extensive experiments, including human evaluation, provide strong evidence supporting the effectiveness of our method. The code is available at https://github.com/dali92002/DocVXQA.


AIS Data-Driven Maritime Monitoring Based on Transformer: A Comprehensive Review

arXiv.org Artificial Intelligence

With the increasing demands for safety, efficiency, and sustainability in global shipping, Automatic Identification System (AIS) data plays an increasingly important role in maritime monitoring. AIS data contains spatial-temporal variation patterns of vessels that hold significant research value in the marine domain. However, due to its massive scale, the full potential of AIS data has long remained untapped. With its powerful sequence modeling capabilities, particularly its ability to capture long-range dependencies and complex temporal dynamics, the Transformer model has emerged as an effective tool for processing AIS data. Therefore, this paper reviews the research on Transformer-based AIS data-driven maritime monitoring, providing a comprehensive overview of the current applications of Transformer models in the marine field. The focus is on Transformer-based trajectory prediction methods, behavior detection, and prediction techniques. Additionally, this paper collects and organizes publicly available AIS datasets from the reviewed papers, performing data filtering, cleaning, and statistical analysis. The statistical results reveal the operational characteristics of different vessel types, providing data support for further research on maritime monitoring tasks. Finally, we offer valuable suggestions for future research, identifying two promising research directions. Datasets are available at https://github.com/eyesofworld/Maritime-Monitoring.


Drive Fast, Learn Faster: On-Board RL for High Performance Autonomous Racing

arXiv.org Artificial Intelligence

Autonomous racing presents unique challenges due to its non-linear dynamics, the high speed involved, and the critical need for real-time decision-making under dynamic and unpredictable conditions. Most traditional Reinforcement Learning (RL) approaches rely on extensive simulation-based pre-training, which faces crucial challenges in transfer effectively to real-world environments. This paper introduces a robust on-board RL framework for autonomous racing, designed to eliminate the dependency on simulation-based pre-training enabling direct real-world adaptation. The proposed system introduces a refined Soft Actor-Critic (SAC) algorithm, leveraging a residual RL structure to enhance classical controllers in real-time by integrating multi-step Temporal-Difference (TD) learning, an asynchronous training pipeline, and Heuristic Delayed Reward Adjustment (HDRA) to improve sample efficiency and training stability. The framework is validated through extensive experiments on the F1TENTH racing platform, where the residual RL controller consistently outperforms the baseline controllers and achieves up to an 11.5 % reduction in lap times compared to the State-of-the-Art (SotA) with only 20 min of training. Additionally, an End-to-End (E2E) RL controller trained without a baseline controller surpasses the previous best results with sustained on-track learning.


Autonomous Robotic Pruning in Orchards and Vineyards: a Review

arXiv.org Artificial Intelligence

Manual pruning is labor intensive and represents up to 25% of annual labor costs in fruit production, notably in apple orchards and vineyards where operational challenges and cost constraints limit the adoption of large-scale machinery. In response, a growing body of research is investigating compact, flexible robotic platforms capable of precise pruning in varied terrains, particularly where traditional mechanization falls short. This paper reviews recent advances in autonomous robotic pruning for orchards and vineyards, addressing a critical need in precision agriculture. Our review examines literature published between 2014 and 2024, focusing on innovative contributions across key system components. Special attention is given to recent developments in machine vision, perception, plant skeletonization, and control strategies, areas that have experienced significant influence from advancements in artificial intelligence and machine learning. The analysis situates these technological trends within broader agricultural challenges, including rising labor costs, a decline in the number of young farmers, and the diverse pruning requirements of different fruit species such as apple, grapevine, and cherry trees. By comparing various robotic architectures and methodologies, this survey not only highlights the progress made toward autonomous pruning but also identifies critical open challenges and future research directions. The findings underscore the potential of robotic systems to bridge the gap between manual and mechanized operations, paving the way for more efficient, sustainable, and precise agricultural practices.


Interpretable Event Diagnosis in Water Distribution Networks

arXiv.org Artificial Intelligence

The increasing penetration of information and communication technologies in the design, monitoring, and control of water systems enables the use of algorithms for detecting and identifying unanticipated events (such as leakages or water contamination) using sensor measurements. However, data-driven methodologies do not always give accurate results and are often not trusted by operators, who may prefer to use their engineering judgment and experience to deal with such events. In this work, we propose a framework for interpretable event diagnosis -- an approach that assists the operators in associating the results of algorithmic event diagnosis methodologies with their own intuition and experience. This is achieved by providing contrasting (i.e., counterfactual) explanations of the results provided by fault diagnosis algorithms; their aim is to improve the understanding of the algorithm's inner workings by the operators, thus enabling them to take a more informed decision by combining the results with their personal experiences. Specifically, we propose counterfactual event fingerprints, a representation of the difference between the current event diagnosis and the closest alternative explanation, which can be presented in a graphical way. The proposed methodology is applied and evaluated on a realistic use case using the L-Town benchmark. Introduction When an event, such as a leakage, occurs in a Water Distribution Network (WDN), this can affect the dynamics of the system by causing changes in the pressures and flows [1]. These changes can be monitored by flow and pressure sensors installed within WDNs. Typically, a limited number of flow sensors are installed at the entrance of District Metered Areas (DMAs) to monitor the overall water inflow in the area [2], while a larger number of pressure sensors (due to reduced capital and installation costs) are installed at certain locations within the DMA to improve leakage detectability [3].


Embodied Intelligence: The Key to Unblocking Generalized Artificial Intelligence

arXiv.org Artificial Intelligence

The ultimate goal of artificial intelligence (AI) is to achieve Artificial General Intelligence (AGI). Embodied Artificial Intelligence (EAI), which involves intelligent systems with physical presence and real-time interaction with the environment, has emerged as a key research direction in pursuit of AGI. While advancements in deep learning, reinforcement learning, large-scale language models, and multimodal technologies have significantly contributed to the progress of EAI, most existing reviews focus on specific technologies or applications. A systematic overview, particularly one that explores the direct connection between EAI and AGI, remains scarce. This paper examines EAI as a foundational approach to AGI, systematically analyzing its four core modules: perception, intelligent decision-making, action, and feedback. We provide a detailed discussion of how each module contributes to the six core principles of AGI. Additionally, we discuss future trends, challenges, and research directions in EAI, emphasizing its potential as a cornerstone for AGI development. Our findings suggest that EAI's integration of dynamic learning and real-world interaction is essential for bridging the gap between narrow AI and AGI.


A Short Overview of Multi-Modal Wi-Fi Sensing

arXiv.org Artificial Intelligence

Wi-Fi sensing has emerged as a significant technology in wireless sensing and Integrated Sensing and Communication (ISAC), offering benefits such as low cost, high penetration, and enhanced privacy. Currently, it is widely utilized in various applications, including action recognition, human localization, and crowd counting. However, Wi-Fi sensing also faces challenges, such as low robustness and difficulties in data collection. Recently, there has been an increasing focus on multi-modal Wi-Fi sensing, where other modalities can act as teachers, providing ground truth or robust features for Wi-Fi sensing models to learn from, or can be directly fused with Wi-Fi for enhanced sensing capabilities. Although these methods have demonstrated promising results and substantial value in practical applications, there is a lack of comprehensive surveys reviewing them. To address this gap, this paper reviews the multi-modal Wi-Fi sensing literature \textbf{from the past 24 months} and highlights the current limitations, challenges and future directions in this field.


A Survey on Data-Driven Modeling of Human Drivers' Lane-Changing Decisions

arXiv.org Artificial Intelligence

--Lane-changing (LC) behavior, a critical yet complex driving maneuver, significantly influences driving safety and traffic dynamics. Traditional analytical LC decision (LCD) models, while effective in specific environments, often oversimplify behavioral heterogeneity and complex interactions, limiting their capacity to capture real LCD. Data-driven approaches address these gaps by leveraging rich empirical data and machine learning to decode latent decision-making patterns, enabling adaptive LCD modeling in dynamic environments. In light of the rapid development of artificial intelligence and the demand for data-driven models oriented towards connected vehicles and autonomous vehicles, this paper presents a comprehensive survey of data-driven LCD models, with a particular focus on human drivers' LC decision-making. It systematically reviews the modeling framework, covering data sources and preprocessing, model inputs and outputs, objectives, structures, and validation methods. This survey further discusses the opportunities and challenges faced by data-driven LCD models, including driving safety, uncertainty, as well as the integration and improvement of technical frameworks. Compared to car-following (CF) behavior, LC behavior entails higher collision risks due to its dependency on holistic evaluations of traffic conditions in both the original and target lanes, requiring drivers to navigate multi-criteria decision-making processes. More specifically, safe LC execution necessitates gaps in the target lane to satisfy collision-avoidance criteria. Drivers must continuously monitor the real-time states of surrounding vehicles (e.g., velocity, acceleration) and adjust their LC maneuvers in response to unexpected behavioral changes (e.g., sudden deceleration, lane encroachment). Human drivers' irrational decision-making (e.g., sudden risk-preference shifts) in dynamic environments pose challenges to traditional LC models based on hypothesis of rational man. This work is supported by the National Natural Science Foundation of China (72288101, 72171018, 72242102). D.-F Xie is with the School of Systems Science, Beijing Jiaotong University, Beijing 100044, China (e-mail: dfxie@bjtu.edu.cn). L. Li is with the Department of Automation, BNRist, Tsinghua University, Beijing 100084, China. He is with Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge MA 02139, the United States (e-mail: he.zb@hotmail.com) This effort will provide critical support for trustworthy traffic simulations, dynamic traffic management, and LC decision-making of autonomous vehicles (A Vs).


Minimizing Risk Through Minimizing Model-Data Interaction: A Protocol For Relying on Proxy Tasks When Designing Child Sexual Abuse Imagery Detection Models

arXiv.org Artificial Intelligence

The distribution of child sexual abuse imagery (CSAI) is an ever-growing concern of our modern world; children who suffered from this heinous crime are revictimized, and the growing amount of illegal imagery distributed overwhelms law enforcement agents (LEAs) with the manual labor of categorization. To ease this burden researchers have explored methods for automating data triage and detection of CSAI, but the sensitive nature of the data imposes restricted access and minimal interaction between real data and learning algorithms, avoiding leaks at all costs. In observing how these restrictions have shaped the literature we formalize a definition of "Proxy Tasks", i.e., the substitute tasks used for training models for CSAI without making use of CSA data. Under this new terminology we review current literature and present a protocol for making conscious use of Proxy Tasks together with consistent input from LEAs to design better automation in this field. Finally, we apply this protocol to study -- for the first time -- the task of Few-shot Indoor Scene Classification on CSAI, showing a final model that achieves promising results on a real-world CSAI dataset whilst having no weights actually trained on sensitive data.


Enterprise Architecture as a Dynamic Capability for Scalable and Sustainable Generative AI adoption: Bridging Innovation and Governance in Large Organisations

arXiv.org Artificial Intelligence

Generative Artificial Intelligence is a powerful new technology with the potential to boost innovation and reshape governance in many industries. Nevertheless, organisations face major challenges in scaling GenAI, including technology complexity, governance gaps and resource misalignments. This study explores how Enterprise Architecture Management can meet the complex requirements of GenAI adoption within large enterprises. Based on a systematic literature review and the qualitative analysis of 16 semi-structured interviews with experts, it examines the relationships between EAM, dynamic capabilities and GenAI adoption. The review identified key limitations in existing EA frameworks, particularly their inability to fully address the unique requirements of GenAI. The interviews, analysed using the Gioia methodology, revealed critical enablers and barriers to GenAI adoption across industries. The findings indicate that EAM, when theorised as sensing, seizing and transforming dynamic capabilities, can enhance GenAI adoption by improving strategic alignment, governance frameworks and organisational agility. However, the study also highlights the need to tailor EA frameworks to GenAI-specific challenges, including low data governance maturity and the balance between innovation and compliance. Several conceptual frameworks are proposed to guide EA leaders in aligning GenAI maturity with organisational readiness. The work contributes to academic understanding and industry practice by clarifying the role of EA in bridging innovation and governance in disruptive technology environments.