Overview
Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps
Alzubaidi, Ahmed, Alsuwaidi, Shaikha, Boussaha, Basma El Amel, AlQadi, Leen, Alkaabi, Omar, Alyafeai, Mohammed, Alobeidli, Hamza, Hacid, Hakim
This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. Our analysis reveals significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. We examine three primary approaches: native collection, translation, and synthetic generation discussing their trade-offs regarding authenticity, scale, and cost. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
Zhou, Chenyue, Wang, Mingxuan, Ma, Yanbiao, Wu, Chenxu, Chen, Wanyi, Qian, Zhe, Liu, Xinyu, Zhang, Yiwei, Wang, Junhao, Xu, Hengbo, Luo, Fei, Chen, Xiaohua, Hao, Xiaoshuai, Li, Hehan, Zhang, Andi, Wang, Wenxuan, Zhang, Kaiyan, Jia, Guoli, Li, Lingling, Lu, Zhiwu, Lu, Yang, Guo, Yike
Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.
PETLP: A Privacy-by-Design Pipeline for Social Media Data in AI Research
Oh, Nick, Vrakas, Giorgos D., Brooke, Siรขn J. M., Moriniรจre, Sasha, Duke, Toju
We introduce PETLP (Privacy-by-design Extract, Transform, Load, and Present), a compliance framework that embeds legal safeguards directly into extended ETL pipelines. Central to PETLP is treating Data Protection Impact Assessments as living documents that evolve from preregistration through dissemination. Through systematic Red-dit analysis, we demonstrate how extraction rights fundamentally differ between qualifying research organisations (who can invoke DSM Article 3 to override platform restrictions) and commercial entities (bound by terms of service), whilst GDPR obligations apply universally. We demonstrate why true anonymisation remains unachievable for social media data and expose the legal gap between permitted dataset creation and uncertain model distribution. By structuring compliance decisions into practical workflows and simplifying institutional data management plans, PETLP enables researchers to navigate regulatory complexity with confidence, bridging the gap between legal requirements and research practice.
Personalized Learning Path Planning with Goal-Driven Learner State Modeling
Lim, Joy Jia Yin, He, Ye, Yu, Jifan, Cong, Xin, Zhang-Li, Daniel, Liu, Zhiyuan, Liu, Huiqin, Hou, Lei, Li, Juanzi, Xu, Bin
Personalized Learning Path Planning (PLPP) aims to design adaptive learning paths that align with individual goals. While large language models (LLMs) show potential in personalizing learning experiences, existing approaches often lack mechanisms for goal-aligned planning. We introduce Pxplore, a novel framework for PLPP that integrates a reinforcement-based training paradigm and an LLM-driven educational architecture. We design a structured learner state model and an automated reward function that transforms abstract objectives into computable signals. We train the policy combining supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), and deploy it within a real-world learning platform. Extensive experiments validate Pxplore's effectiveness in producing coherent, personalized, and goal-driven learning paths. We release our code and dataset to facilitate future research.
LLM-Enabled In-Context Learning for Data Collection Scheduling in UAV-assisted Sensor Networks
Emami, Yousef, Zhou, Hao, Nabavirazani, SeyedSina, Almeida, Luis
Unmanned Aerial Vehicles (UAVs) are increasingly being utilized in various private and commercial applications, e.g., traffic control, parcel delivery, and Search and Rescue (SAR) missions. Machine Learning (ML) methods used in UAV-Assisted Sensor Networks (UASNETs) and, especially, in Deep Reinforcement Learning (DRL) face challenges such as complex and lengthy model training, gaps between simulation and reality, and low sampling efficiency, which conflict with the urgency of emergencies, such as SAR missions. In this paper, an In-Context Learning (ICL)-Data Collection Scheduling (ICLDC) system is proposed as an alternative to DRL in emergencies. The UAV collects sensory data and transmits it to a Large Language Model (LLM), which creates a task description in natural language. From this description, the UAV receives a data collection schedule that must be executed. A verifier ensures safe UAV operations by evaluating the schedules generated by the LLM and overriding unsafe schedules based on predefined rules. The system continuously adapts by incorporating feedback into the task descriptions and using this for future decisions. This method is tested against jailbreaking attacks, where the task description is manipulated to undermine network performance, highlighting the vulnerability of LLMs to such attacks. The proposed ICLDC significantly reduces cumulative packet loss compared to both the DQN and Maximum Channel Gain baselines. ICLDC presents a promising direction for intelligent scheduling and control in UASNETs.
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Han, Xiaofeng, Chen, Shunpeng, Fu, Zenghuang, Feng, Zhe, Fan, Lue, An, Dong, Wang, Changwei, Guo, Li, Meng, Weiliang, Zhang, Xiaopeng, Xu, Rongtao, Xu, Shibiao
Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We adopt a task-oriented perspective to systematically review the applications and advancements of multimodal fusion methods and VLMs in the field of robot vision. For semantic scene understanding tasks, we categorize fusion approaches into encoder-decoder frameworks, attention-based architectures, and graph neural networks. Meanwhile, we also analyze the architectural characteristics and practical implementations of these fusion strategies in key tasks such as simultaneous localization and mapping (SLAM), 3D object detection, navigation, and manipulation. We compare the evolutionary paths and applicability of VLMs based on large language models (LLMs) with traditional multimodal fusion methods.Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Building on this analysis, we identify key challenges in current research, including cross-modal alignment, efficient fusion, real-time deployment, and domain adaptation. We propose future directions such as self-supervised learning for robust multimodal representations, structured spatial memory and environment modeling to enhance spatial intelligence, and the integration of adversarial robustness and human feedback mechanisms to enable ethically aligned system deployment. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.
The Role of Computing Resources in Publishing Foundation Model Research
Hao, Yuexing, Huang, Yue, Zhang, Haoran, Zhao, Chenyang, Liang, Zhenwen, Liang, Paul Pu, Zhao, Yue, Sun, Lichao, Kalantari, Saleh, Zhang, Xiangliang, Ghassemi, Marzyeh
Artificial Intelligence (AI) and machine learning (ML) models have made stark advances in the past three years, fueled by the development of foundation models (FM) trained on large-scale multimodal data. Following the public release of several successful FMs (OpenAI (2022); Brown et al. (2020); Bommasani et al. (2022)), FMs such as large language models (LLMs) and vision language models (VLMs) have bridged vision, language, and other modalities. In many Computer Science subfields such as Natural Language Processing (NLP) and Computer Vision (CV), FMs have demonstrated strong compositional performance and generalization capabilities (Awais et al. (2025); Gunter et al. (2024)), emerging as widely-used tools (Bommasani et al. (2022)) that provide a flexible backbones for innovation in other fields (Moor et al. (2023); Sartor & Thompson (2025); Firoozi et al. (2024)). Conducting FM research requires significant data, computing, and human resources (Cottier et al. (2024); Maslej et al. (2024); Crawford (2024)). A central concern in the field is whether greater access to such resources directly translates into more impactful research outcomes (Acemoglu (2024); Dodge et al. (2019); OpenAI (2018)), such as more research publications, or higher citation counts (Sinclair et al. (2023); Anjum et al. (2019)). The answer to this question has important implications for how resources are allocated, which research directions are prioritized, and how equitable participation in FM research can be ensured. However, the cost of research is often difficult to quantify due to lack of uniform disclosure on resource distribution (Bommasani et al. (2024)). Absent widespread disclosure, funding is perhaps most easily characterized in the concrete cost of purchasing or renting hardware (e.g., computing clusters, or chips), through there are also software, cloud storage services, and specialized software platform costs.
Semantic Communication Enabled Holographic Video Processing and Transmission
Ying, Jingkai, Qi, Zhiyuan, Feng, Yulong, Qin, Zhijin, Han, Zhu, Tafazolli, Rahim, Eldar, Yonina C.
Abstract--Holographic video communication is considered a paradigm shift in visual communications, becoming increasingly popular for its ability to offer immersive experiences. This article provides an overview of holographic video communication and outlines the requirements of a holographic video communication system. Particularly, following a brief review of semantic communication, an architecture for a semantic-enabled holographic video communication system is presented. Key technologies, including semantic sampling, joint semantic-channel coding, and semantic-aware transmission, are designed based on the proposed architecture. Two related use cases are presented to demonstrate the performance gain of the proposed methods. Finally, potential research topics are discussed to pave the way for the realization of semantic-enabled holographic video communications. Holographic video is a revolutionary information modality, which provides panoramic video content and an immer-sive experience based on three-dimensional view and high-resolution holograms [1]. Holographic video communication (HVC) is regarded as the dominant paradigm for future visual-type communications. It is considered the potential method to realize metaverse and enable numerous applications, such as holographic conferencing, education, and entertainment.
Prediction Markets with Intermittent Contributions
Vitali, Michael, Pinson, Pierre
Although both data availability and the demand for accurate forecasts are increasing, collaboration between stakeholders is often constrained by data ownership and competitive interests. In contrast to recent proposals within cooperative game-theoretical frameworks, we place ourselves in a more general framework, based on prediction markets. There, independent agents trade forecasts of uncertain future events in exchange for rewards. We introduce and analyse a prediction market that (i) accounts for the historical performance of the agents, (ii) adapts to time-varying conditions, while (iii) permitting agents to enter and exit the market at will. The proposed design employs robust regression models to learn the optimal forecasts' combination whilst handling missing submissions. Moreover, we introduce a pay-off allocation mechanism that considers both in-sample and out-of-sample performance while satisfying several desirable economic properties. Case-studies using simulated and real-world data allow demonstrating the effectiveness and adaptability of the proposed market design.
Document Intelligence in the Era of Large Language Models: A Survey
Wang, Weishi, Hu, Hengchang, Zhang, Zhijie, Li, Zhaochen, Shao, Hongxin, Dahlmeier, Daniel
Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs). While earlier approaches relied on encoder-decoder architectures, decoder-only LLMs have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI's evolution, highlighting current research attempts and future prospects of LLMs in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.