AITopics

2510.1343

Country:

Europe (1.00)
North America (0.67)
Asia > Middle East > UAE (0.46)

Genre: Overview (1.00)

Industry:

Education (0.94)
Law (0.69)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceOct-17-2025

From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

Zhou, Chenyue, Wang, Mingxuan, Ma, Yanbiao, Wu, Chenxu, Chen, Wanyi, Qian, Zhe, Liu, Xinyu, Zhang, Yiwei, Wang, Junhao, Xu, Hengbo, Luo, Fei, Chen, Xiaohua, Hao, Xiaoshuai, Li, Hehan, Zhang, Andi, Wang, Wenxuan, Zhang, Kaiyan, Jia, Guoli, Li, Lingling, Lu, Zhiwu, Lu, Yang, Guo, Yike

Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition." We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

2509.25373

Country: Asia > China (0.28)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.34)

Industry:

Health & Medicine > Therapeutic Area (0.92)
Education (0.67)
Health & Medicine > Diagnostic Medicine > Imaging (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Oh, Nick, Vrakas, Giorgos D., Brooke, Siân J. M., Morinière, Sasha, Duke, Toju

PETLP: A Privacy-by-Design Pipeline for Social Media Data in AI Research

arXiv.org Artificial IntelligenceOct-17-2025

We introduce PETLP (Privacy-by-design Extract, Transform, Load, and Present), a compliance framework that embeds legal safeguards directly into extended ETL pipelines. Central to PETLP is treating Data Protection Impact Assessments as living documents that evolve from preregistration through dissemination. Through systematic Red-dit analysis, we demonstrate how extraction rights fundamentally differ between qualifying research organisations (who can invoke DSM Article 3 to override platform restrictions) and commercial entities (bound by terms of service), whilst GDPR obligations apply universally. We demonstrate why true anonymisation remains unachievable for social media data and expose the legal gap between permitted dataset creation and uncertain model distribution. By structuring compliance decisions into practical workflows and simplifying institutional data management plans, PETLP enables researchers to navigate regulatory complexity with confidence, bridging the gap between legal requirements and research practice.

large language model, machine learning, natural language, (23 more...)

2508.09232

Country: Europe (1.00)

Genre:

Research Report > Experimental Study (1.00)
Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government > Regional Government > Europe Government (0.47)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
(2 more...)

Personalized Learning Path Planning with Goal-Driven Learner State Modeling

Lim, Joy Jia Yin, He, Ye, Yu, Jifan, Cong, Xin, Zhang-Li, Daniel, Liu, Zhiyuan, Liu, Huiqin, Hou, Lei, Li, Juanzi, Xu, Bin

Personalized Learning Path Planning (PLPP) aims to design adaptive learning paths that align with individual goals. While large language models (LLMs) show potential in personalizing learning experiences, existing approaches often lack mechanisms for goal-aligned planning. We introduce Pxplore, a novel framework for PLPP that integrates a reinforcement-based training paradigm and an LLM-driven educational architecture. We design a structured learner state model and an automated reward function that transforms abstract objectives into computable signals. We train the policy combining supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), and deploy it within a real-world learning platform. Extensive experiments validate Pxplore's effectiveness in producing coherent, personalized, and goal-driven learning paths. We release our code and dataset to facilitate future research.

large language model, learner, machine learning, (18 more...)

2510.13215

Country: North America > United States (0.28)

Genre:

Research Report (1.00)
Questionnaire & Opinion Survey (1.00)
Overview (1.00)
Instructional Material > Course Syllabus & Notes (0.46)

Industry:

Education > Educational Technology > Educational Software > Computer Based Training (1.00)
Education > Educational Setting > Online (1.00)
Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Emami, Yousef, Zhou, Hao, Nabavirazani, SeyedSina, Almeida, Luis

LLM-Enabled In-Context Learning for Data Collection Scheduling in UAV-assisted Sensor Networks

Unmanned Aerial Vehicles (UAVs) are increasingly being utilized in various private and commercial applications, e.g., traffic control, parcel delivery, and Search and Rescue (SAR) missions. Machine Learning (ML) methods used in UAV-Assisted Sensor Networks (UASNETs) and, especially, in Deep Reinforcement Learning (DRL) face challenges such as complex and lengthy model training, gaps between simulation and reality, and low sampling efficiency, which conflict with the urgency of emergencies, such as SAR missions. In this paper, an In-Context Learning (ICL)-Data Collection Scheduling (ICLDC) system is proposed as an alternative to DRL in emergencies. The UAV collects sensory data and transmits it to a Large Language Model (LLM), which creates a task description in natural language. From this description, the UAV receives a data collection schedule that must be executed. A verifier ensures safe UAV operations by evaluating the schedules generated by the LLM and overriding unsafe schedules based on predefined rules. The system continuously adapts by incorporating feedback into the task descriptions and using this for future decisions. This method is tested against jailbreaking attacks, where the task description is manipulated to undermine network performance, highlighting the vulnerability of LLMs to such attacks. The proposed ICLDC significantly reduces cumulative packet loss compared to both the DQN and Maximum Channel Gain baselines. ICLDC presents a promising direction for intelligent scheduling and control in UASNETs.

large language model, machine learning, natural language, (16 more...)

2504.14556

Country:

North America > Canada (0.46)
Europe > Portugal (0.28)
North America > United States (0.28)
Asia > Middle East (0.28)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Telecommunications (1.00)
Information Technology > Security & Privacy (1.00)
Energy (1.00)
Government > Military (0.68)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision

Han, Xiaofeng, Chen, Shunpeng, Fu, Zenghuang, Feng, Zhe, Fan, Lue, An, Dong, Wang, Changwei, Guo, Li, Meng, Weiliang, Zhang, Xiaopeng, Xu, Rongtao, Xu, Shibiao

Robot vision has greatly benefited from advancements in multimodal fusion techniques and vision-language models (VLMs). We adopt a task-oriented perspective to systematically review the applications and advancements of multimodal fusion methods and VLMs in the field of robot vision. For semantic scene understanding tasks, we categorize fusion approaches into encoder-decoder frameworks, attention-based architectures, and graph neural networks. Meanwhile, we also analyze the architectural characteristics and practical implementations of these fusion strategies in key tasks such as simultaneous localization and mapping (SLAM), 3D object detection, navigation, and manipulation. We compare the evolutionary paths and applicability of VLMs based on large language models (LLMs) with traditional multimodal fusion methods.Additionally, we conduct an in-depth analysis of commonly used datasets, evaluating their applicability and challenges in real-world robotic scenarios. Building on this analysis, we identify key challenges in current research, including cross-modal alignment, efficient fusion, real-time deployment, and domain adaptation. We propose future directions such as self-supervised learning for robust multimodal representations, structured spatial memory and environment modeling to enhance spatial intelligence, and the integration of adversarial robustness and human feedback mechanisms to enable ethically aligned system deployment. Through a comprehensive review, comparative analysis, and forward-looking discussion, we provide a valuable reference for advancing multimodal perception and interaction in robotic vision. A comprehensive list of studies in this survey is available at https://github.com/Xiaofeng-Han-Res/MF-RV.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

doi: 10.1016/j.inffus.2025.103652

2504.02477

Country:

Europe (0.67)
North America > United States > Minnesota (0.27)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.67)
Research Report > New Finding (0.45)

Industry:

Energy (1.00)
Automobiles & Trucks (0.93)
Health & Medicine > Therapeutic Area > Neurology (0.92)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (1.00)
(3 more...)

The Role of Computing Resources in Publishing Foundation Model Research

Hao, Yuexing, Huang, Yue, Zhang, Haoran, Zhao, Chenyang, Liang, Zhenwen, Liang, Paul Pu, Zhao, Yue, Sun, Lichao, Kalantari, Saleh, Zhang, Xiangliang, Ghassemi, Marzyeh

Artificial Intelligence (AI) and machine learning (ML) models have made stark advances in the past three years, fueled by the development of foundation models (FM) trained on large-scale multimodal data. Following the public release of several successful FMs (OpenAI (2022); Brown et al. (2020); Bommasani et al. (2022)), FMs such as large language models (LLMs) and vision language models (VLMs) have bridged vision, language, and other modalities. In many Computer Science subfields such as Natural Language Processing (NLP) and Computer Vision (CV), FMs have demonstrated strong compositional performance and generalization capabilities (Awais et al. (2025); Gunter et al. (2024)), emerging as widely-used tools (Bommasani et al. (2022)) that provide a flexible backbones for innovation in other fields (Moor et al. (2023); Sartor & Thompson (2025); Firoozi et al. (2024)). Conducting FM research requires significant data, computing, and human resources (Cottier et al. (2024); Maslej et al. (2024); Crawford (2024)). A central concern in the field is whether greater access to such resources directly translates into more impactful research outcomes (Acemoglu (2024); Dodge et al. (2019); OpenAI (2018)), such as more research publications, or higher citation counts (Sinclair et al. (2023); Anjum et al. (2019)). The answer to this question has important implications for how resources are allocated, which research directions are prioritized, and how equitable participation in FM research can be ensured. However, the cost of research is often difficult to quantify due to lack of uniform disclosure on resource distribution (Bommasani et al. (2024)). Absent widespread disclosure, funding is perhaps most easily characterized in the concrete cost of purchasing or renting hardware (e.g., computing clusters, or chips), through there are also software, cloud storage services, and specialized software platform costs.

large language model, machine learning, natural language, (20 more...)

2510.13621

Country: North America > United States > California (0.93)

Genre:

Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)
Overview (1.00)

Industry:

Information Technology (0.94)
Social Sector (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.55)

Semantic Communication Enabled Holographic Video Processing and Transmission

Ying, Jingkai, Qi, Zhiyuan, Feng, Yulong, Qin, Zhijin, Han, Zhu, Tafazolli, Rahim, Eldar, Yonina C.

Abstract--Holographic video communication is considered a paradigm shift in visual communications, becoming increasingly popular for its ability to offer immersive experiences. This article provides an overview of holographic video communication and outlines the requirements of a holographic video communication system. Particularly, following a brief review of semantic communication, an architecture for a semantic-enabled holographic video communication system is presented. Key technologies, including semantic sampling, joint semantic-channel coding, and semantic-aware transmission, are designed based on the proposed architecture. Two related use cases are presented to demonstrate the performance gain of the proposed methods. Finally, potential research topics are discussed to pave the way for the realization of semantic-enabled holographic video communications. Holographic video is a revolutionary information modality, which provides panoramic video content and an immer-sive experience based on three-dimensional view and high-resolution holograms [1]. Holographic video communication (HVC) is regarded as the dominant paradigm for future visual-type communications. It is considered the potential method to realize metaverse and enable numerous applications, such as holographic conferencing, education, and entertainment.

artificial intelligence, machine learning, point cloud, (15 more...)

2510.13408

Country:

Asia > China (0.29)
North America > United States (0.29)

Genre:

Research Report (0.82)
Overview (0.74)

Industry: Energy (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.95)

Vitali, Michael, Pinson, Pierre

Prediction Markets with Intermittent Contributions

Although both data availability and the demand for accurate forecasts are increasing, collaboration between stakeholders is often constrained by data ownership and competitive interests. In contrast to recent proposals within cooperative game-theoretical frameworks, we place ourselves in a more general framework, based on prediction markets. There, independent agents trade forecasts of uncertain future events in exchange for rewards. We introduce and analyse a prediction market that (i) accounts for the historical performance of the agents, (ii) adapts to time-varying conditions, while (iii) permitting agents to enter and exit the market at will. The proposed design employs robust regression models to learn the optimal forecasts' combination whilst handling missing submissions. Moreover, we introduce a pay-off allocation mechanism that considers both in-sample and out-of-sample performance while satisfying several desirable economic properties. Case-studies using simulated and real-world data allow demonstrating the effectiveness and adaptability of the proposed market design.

artificial intelligence, machine learning, prediction market, (17 more...)

2510.13385

Country: North America > United States (0.94)

Genre:

Research Report (0.82)
Overview (0.68)

Industry:

Government > Regional Government > North America Government > United States Government (0.94)
Banking & Finance > Trading > Prediction Market (0.82)
Energy > Renewable > Wind (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.36)

Document Intelligence in the Era of Large Language Models: A Survey

Wang, Weishi, Hu, Hengchang, Zhang, Zhijie, Li, Zhaochen, Shao, Hongxin, Dahlmeier, Daniel

Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs). While earlier approaches relied on encoder-decoder architectures, decoder-only LLMs have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI's evolution, highlighting current research attempts and future prospects of LLMs in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.

computational linguistic, large language model, machine learning, (16 more...)

2510.13366

Country:

Asia (1.00)
Europe > France (0.67)
North America > United States > California (0.67)
North America > United States > Florida > Miami-Dade County > Miami (0.14)

Genre:

Research Report (1.00)
Overview (1.00)

Industry: Information Technology (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)