Overview
LLM Agents Are the Antidote to Walled Gardens
While the Internet's core infrastructure was designed to be open and universal, today's application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock-in. We argue that LLM-based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift universal interoperability: the ability for any two digital services to exchange data seamlessly using AI-mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks and technical debt. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security.
Towards Reliable Forgetting: A Survey on Machine Unlearning Verification
Xue, Lulu, Hu, Shengshan, Lu, Wei, Shen, Yan, Li, Dongxu, Guo, Peijin, Zhou, Ziqi, Li, Minghui, Zhang, Yanjun, Zhang, Leo Yu
With growing demands for privacy protection, security, and legal compliance (e.g., GDPR), machine unlearning has emerged as a critical technique for ensuring the controllability and regulatory alignment of machine learning models. However, a fundamental challenge in this field lies in effectively verifying whether unlearning operations have been successfully and thoroughly executed. Despite a growing body of work on unlearning techniques, verification methodologies remain comparatively underexplored and often fragmented. Existing approaches lack a unified taxonomy and a systematic framework for evaluation. To bridge this gap, this paper presents the first structured survey of machine unlearning verification methods. We propose a taxonomy that organizes current techniques into two principal categories -- behavioral verification and parametric verification -- based on the type of evidence used to assess unlearning fidelity. We examine representative methods within each category, analyze their underlying assumptions, strengths, and limitations, and identify potential vulnerabilities in practical deployment. In closing, we articulate a set of open problems in current verification research, aiming to provide a foundation for developing more robust, efficient, and theoretically grounded unlearning verification mechanisms.
GPS-Aided Deep Learning for Beam Prediction and Tracking in UAV mmWave Communication
Nugroho, Vendi Ardianto, Lee, Byung Moo
This work has been published in the IEEE Access with DOI: 10.1109/ACCESS.2025.3586594. Abstract --Millimeter-wave (mmWave) communication enables high data rates for cellular-connected Uncrewed Aerial V ehicles (UA Vs). However, a robust beam management remains challenging due to significant path loss and the dynamic mobility of UA Vs, which can destabilize the UA V-base station (BS) link. This research presents a GPS-aided deep learning (DL) model that simultaneously predicts current and future optimal beams for UA V mmWave communications, maintaining a T op-1 prediction accuracy exceeding 70% and an average power loss below 0.6 dB across all prediction steps. These outcomes stem from a proposed data set splitting method ensuring balanced label distribution, paired with a GPS preprocessing technique that extracts key positional features, and a DL architecture that maps sequential position data to beam index predictions. The model reduces overhead by approximately 93% (requiring the training of 2 3 beams instead of 32 beams) with 95% beam prediction accuracy guarantees, and ensures 94% to 96% of predictions exhibit mean power loss not exceeding 1 dB. Uncrewed Aerial V ehicles (UA V) are expected to serve two roles in wireless networks both as user equipment (UE) that accesses cellular network (cellular-connected UA V) and as UA V -assisted communication platforms providing aerial base stations (BS) and relays for terrestrial users [1] As the high path loss characteristic of mmWave, deploying large antenna arrays on the BS side helps mitigate it by generating narrow beams with strong beamforming gains [4]. As a result, mmWave communications rely heavily on efficient beam management--including beam training and tracking--to quickly select the appropriate beams during intra-and inter-cell mobility, minimizing the risk of beam misalignment [5].
Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement
Ye, Haoran, Jin, Jing, Xie, Yuhang, Zhang, Xin, Song, Guojie
The advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. This progress presents novel challenges, such as measuring human-like psychological constructs, moving beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with psychometrics, the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence. This review paper introduces and synthesizes the emerging interdisciplinary field of LLM Psychometrics, which leverages psychometric instruments, theories, and principles to evaluate, understand, and enhance LLMs. The reviewed literature systematically shapes benchmarking principles, broadens evaluation scopes, refines methodologies, validates results, and advances LLM capabilities. Diverse perspectives are integrated to provide a structured framework for researchers across disciplines, enabling a more comprehensive understanding of this nascent field. Ultimately, the review provides actionable insights for developing future evaluation paradigms that align with human-level AI and promote the advancement of human-centered AI systems for societal benefit. A curated repository of LLM psychometric resources is available at https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics.
Survey for Categorising Explainable AI Studies Using Data Analysis Task Frameworks
Ziadeh, Hamzah, Knoche, Hendrik
Research into explainable artificial intelligence (XAI) for data analysis tasks suffer from a large number of contradictions and lack of concrete design recommendations stemming from gaps in understanding the tasks that require AI assistance. In this paper, we drew on multiple fields such as visual analytics, cognition, and dashboard design to propose a method for categorising and comparing XAI studies under three dimensions: what, why, and who. We identified the main problems as: inadequate descriptions of tasks, context-free studies, and insufficient testing with target users. We propose that studies should specifically report on their users' domain, AI, and data analysis expertise to illustrate the generalisability of their findings. We also propose study guidelines for designing and reporting XAI tasks to improve the XAI community's ability to parse the rapidly growing field. We hope that our contribution can help researchers and designers better identify which studies are most relevant to their work, what gaps exist in the research, and how to handle contradictory results regarding XAI design.
BlueGlass: A Framework for Composite AI Safety
Nandigramwar, Harshal, Qutub, Syed, Scholl, Kay-Ulrich
As AI systems become increasingly capable and ubiquitous, ensuring the safety of these systems is critical. However, existing safety tools often target different aspects of model safety and cannot provide full assurance in isolation, highlighting a need for integrated and composite methodologies. This paper introduces BlueGlass, a framework designed to facilitate composite AI safety workflows by providing a unified infrastructure enabling the integration and composition of diverse safety tools that operate across model internals and outputs. Furthermore, to demonstrate the utility of this framework, we present three safety-oriented analyses on vision-language models for the task of object detection: (1) distributional evaluation, revealing performance trade-offs and potential failure modes across distributions; (2) probe-based analysis of layer dynamics highlighting shared hierarchical learning via phase transition; and (3) sparse autoencoders identifying interpretable concepts. More broadly, this work contributes foundational infrastructure and findings for building more robust and reliable AI systems.
Automating SPARQL Query Translations between DBpedia and Wikidata
Bartels, Malte Christian, Banerjee, Debayan, Usbeck, Ricardo
This paper investigates whether state-of-the-art Large Language Models (LLMs) can automatically translate SPARQL between popular Knowledge Graph (KG) schemas. We focus on translations between the DBpedia and Wikidata KG, and later on DBLP and OpenAlex KG. This study addresses a notable gap in KG interoperability research by rigorously evaluating LLM performance on SPARQL-to-SPARQL translation. Two benchmarks are assembled, where the first align 100 DBpedia-Wikidata queries from QALD-9-Plus; the second contains 100 DBLP queries aligned to OpenAlex, testing generalizability beyond encyclopaedic KGs. Three open LLMs: Llama-3-8B, DeepSeek-R1-Distill-Llama-70B, and Mistral-Large-Instruct-2407 are selected based on their sizes and architectures and tested with zero-shot, few-shot, and two chain-of-thought variants. Outputs were compared with gold answers, and resulting errors were categorized. We find that the performance varies markedly across models and prompting strategies, and that translations for Wikidata to DBpedia work far better than translations for DBpedia to Wikidata.
Demonstrating the Octopi-1.5 Visual-Tactile-Language Model
Yu, Samson, Lin, Kelvin, Soh, Harold
Touch is recognized as a vital sense for humans and an equally important modality for robots, especially for dexterous manipulation, material identification, and scenarios involving visual occlusion. Building upon very recent work in touch foundation models, this demonstration will feature Octopi-1.5, our latest visual-tactile-language model. Compared to its predecessor, Octopi-1.5 introduces the ability to process tactile signals from multiple object parts and employs a simple retrieval-augmented generation (RAG) module to improve performance on tasks and potentially learn new objects on-the-fly. The system can be experienced live through a new handheld tactile-enabled interface, the TMI, equipped with GelSight and TAC-02 tactile sensors. This convenient and accessible setup allows users to interact with Octopi-1.5 without requiring a robot. During the demonstration, we will showcase Octopi-1.5 solving tactile inference tasks by leveraging tactile inputs and commonsense knowledge. For example, in a Guessing Game, Octopi-1.5 will identify objects being grasped and respond to follow-up queries about how to handle it (e.g., recommending careful handling for soft fruits). We also plan to demonstrate Octopi-1.5's RAG capabilities by teaching it new items. With live interactions, this demonstration aims to highlight both the progress and limitations of VTLMs such as Octopi-1.5 and to foster further interest in this exciting field. Code for Octopi-1.5 and design files for the TMI gripper are available at https://github.com/clear-nus/octopi-1.5.
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
Ding, Yihao, Luo, Siwen, Dai, Yue, Jiang, Yanbei, Li, Zechuan, Martin, Geoffrey, Peng, Yifan
Visually-Rich Document Understanding (VRDU) has emerged as a critical field, driven by the need to automatically process documents containing complex visual, textual, and layout information. Recently, Multimodal Large Language Models (MLLMs) have shown remarkable potential in this domain, leveraging both Optical Character Recognition (OCR)-dependent and OCR-free frameworks to extract and interpret information in document images. This survey reviews recent advancements in MLLM-based VRDU, highlighting three core components: (1) methods for encoding and fusing textual, visual, and layout features; (2) training paradigms, including pretraining strategies, instruction-response tuning, and the trainability of different model modules; and (3) datasets utilized for pretraining, instruction-tuning, and supervised fine-tuning. Finally, we discuss the challenges and opportunities in this evolving field and propose future directions to advance the efficiency, generalizability, and robustness of VRDU systems.
Toward accurate RUL and SOH estimation using reinforced graph-based PINNs enhanced with dynamic weights
Pour, Mohamadreza Akbari, Ghasemzadeh, Ali, Bijarchi, MohamadAli, Shafii, Mohammad Behshad
Accurate estimation of Remaining Useful Life (RUL) and State of Health (SOH) is essential for Prognostics and Health Management (PHM) across a wide range of industrial applications. We propose a novel framework -- Reinforced Graph-Based Physics-Informed Neural Networks Enhanced with Dynamic Weights (RGPD) -- that combines physics-based supervision with advanced spatio-temporal learning. Graph Convolutional Recurrent Networks (GCRNs) embed graph-convolutional filters within recurrent units to capture how node representations evolve over time. Graph Attention Convolution (GATConv) leverages a self-attention mechanism to compute learnable, edge-wise attention coefficients, dynamically weighting neighbor contributions for adaptive spatial aggregation. A Soft Actor-Critic (SAC) module is positioned between the Temporal Attention Unit (TAU) and GCRN to further improve the spatio-temporal learning. This module improves attention and prediction accuracy by dynamically scaling hidden representations to minimize noise and highlight informative features. To identify the most relevant physical constraints in each area, Q-learning agents dynamically assign weights to physics-informed loss terms, improving generalization across real-time industrial systems and reducing the need for manual tuning. In both RUL and SOH estimation tasks, the proposed method consistently outperforms state-of-the-art models, demonstrating strong robustness and predictive accuracy across varied degradation patterns across three diverse industrial benchmark datasets.