Information Fusion
A Methodological and Structural Review of Parkinsons Disease Detection Across Diverse Data Modalities
Miah, Abu Saleh Musa, Suzuki, taro, Shin, Jungpil
Parkinsons Disease (PD) is a progressive neurological disorder that primarily affects motor functions and can lead to mild cognitive impairment (MCI) and dementia in its advanced stages. With approximately 10 million people diagnosed globally 1 to 1.8 per 1,000 individuals, according to reports by the Japan Times and the Parkinson Foundation early and accurate diagnosis of PD is crucial for improving patient outcomes. While numerous studies have utilized machine learning (ML) and deep learning (DL) techniques for PD recognition, existing surveys are limited in scope, often focusing on single data modalities and failing to capture the potential of multimodal approaches. To address these gaps, this study presents a comprehensive review of PD recognition systems across diverse data modalities, including Magnetic Resonance Imaging (MRI), gait-based pose analysis, gait sensory data, handwriting analysis, speech test data, Electroencephalography (EEG), and multimodal fusion techniques. Based on over 347 articles from leading scientific databases, this review examines key aspects such as data collection methods, settings, feature representations, and system performance, with a focus on recognition accuracy and robustness. This survey aims to serve as a comprehensive resource for researchers, providing actionable guidance for the development of next generation PD recognition systems. By leveraging diverse data modalities and cutting-edge machine learning paradigms, this work contributes to advancing the state of PD diagnostics and improving patient care through innovative, multimodal approaches.
Decentralized Fusion of 3D Extended Object Tracking based on a B-Spline Shape Model
Han, Longfei, Kefferpรผtz, Klaus, Beyerer, Jรผrgen
Extended Object Tracking (EOT) exploits the high resolution of modern sensors for detailed environmental perception. Combined with decentralized fusion, it contributes to a more scalable and robust perception system. This paper investigates the decentralized fusion of 3D EOT using a B-spline curve based model. The spline curve is used to represent the side-view profile, which is then extruded with a width to form a 3D shape. We use covariance intersection (CI) for the decentralized fusion and discuss the challenge of applying it to EOT. We further evaluate the tracking result of the decentralized fusion with simulated and real datasets of traffic scenarios. We show that the CI-based fusion can significantly improve the tracking performance for sensors with unfavorable perspective.
Implementation Analysis of Collaborative Robot Digital Twins in Physics Engines
Kรถnig, Christian, Petershans, Jan, Herbst, Jan, Rรผb, Matthias, Krummacker, Dennis, Mittag, Eric, Schotten, Hans D.
This paper presents a Digital Twin (DT) of a 6G communications system testbed that integrates two robotic manipulators with a high-precision optical infrared tracking system in Unreal Engine 5. Practical details of the setup and implementation insights provide valuable guidance for users aiming to replicate such systems, an endeavor that is crucial to advancing DT applications within the scientific community. Key topics discussed include video streaming, integration within the Robot Operating System 2 (ROS 2), and bidirectional communication. The insights provided are intended to support the development and deployment of DTs in robotics and automation research.
Rethinking Few-Shot Image Fusion: Granular Ball Priors Enable General-Purpose Deep Fusion
Deng, Minjie, Wei, Yan, Zhai, Hao, Wu, An, Ouyang, Yuncan, Peng, Qianyao
In image fusion tasks, the absence of real fused images as priors presents a fundamental challenge. Most deep learning-based fusion methods rely on large-scale paired datasets to extract global weighting features from raw images, thereby generating fused outputs that approximate real fused images. In contrast to previous studies, this paper explores few-shot training of neural networks under the condition of having prior knowledge. We propose a novel fusion framework named GBFF, and a Granular Ball Significant Extraction algorithm specifically designed for the few-shot prior setting. All pixel pairs involved in the fusion process are initially modeled as a Coarse-Grained Granular Ball. At the local level, Fine-Grained Granular Balls are used to slide through the brightness space to extract Non-Salient Pixel Pairs, and perform splitting operations to obtain Salient Pixel Pairs. Pixel-wise weights are then computed to generate a pseudo-supervised image. At the global level, pixel pairs with significant contributions to the fusion process are categorized into the Positive Region, while those whose contributions cannot be accurately determined are assigned to the Boundary Region. The Granular Ball performs modality-aware adaptation based on the proportion of the positive region, thereby adjusting the neural network's loss function and enabling it to complement the information of the boundary region. Extensive experiments demonstrate the effectiveness of both the proposed algorithm and the underlying theory. Compared with state-of-the-art (SOTA) methods, our approach shows strong competitiveness in terms of both fusion time and image expressiveness. Our code is publicly available at:
STFM: A Spatio-Temporal Information Fusion Model Based on Phase Space Reconstruction for Sea Surface Temperature Prediction
Wang, Yin, Gong, Chunlin, Wu, Xiang, Zhang, Hanleran
The sea surface temperature (SST), a key environmental parameter, is crucial to optimizing production planning, making its accurate prediction a vital research topic. However, the inherent nonlinearity of the marine dynamic system presents significant challenges. Current forecasting methods mainly include physics-based numerical simulations and data-driven machine learning approaches. The former, while describing SST evolution through differential equations, suffers from high computational complexity and limited applicability, whereas the latter, despite its computational benefits, requires large datasets and faces interpretability challenges. This study presents a prediction framework based solely on data-driven techniques. Using phase space reconstruction, we construct initial-delay attractor pairs with a mathematical homeomorphism and design a Spatio-Temporal Fusion Mapping (STFM) to uncover their intrinsic connections. Unlike conventional models, our method captures SST dynamics efficiently through phase space reconstruction and achieves high prediction accuracy with minimal training data in comparative tests
4D Multimodal Co-attention Fusion Network with Latent Contrastive Alignment for Alzheimer's Diagnosis
Wei, Yuxiang, Zhang, Yanteng, Xiao, Xi, Wang, Tianyang, Wang, Xiao, Calhoun, Vince D.
--Multimodal neuroimaging provides complementary structural and functional insights into both human brain organization and disease-related dynamics. Recent studies demonstrate enhanced diagnostic sensitivity for Alzheimer's disease (AD) through synergistic integration of neuroimaging data (e.g., sMRI, fMRI) with behavioral cognitive scores tabular data biomarkers. However, the intrinsic heterogeneity across modalities (e.g., 4D spatiotemporal fMRI dynamics vs. 3D anatomical sMRI structure) presents critical challenges for discriminative feature fusion. T o bridge this gap, we propose M2M-AlignNet: a geometry-aware multimodal co-attention network with latent alignment for early AD diagnosis using sMRI and fMRI. At the core of our approach is a multi-patch-to-multi-patch (M2M) contrastive loss function that quantifies and reduces representational discrepancies via geometry-weighted patch correspondence, explicitly aligning fMRI components across brain regions with their sMRI structural substrates without one-to-one constraints. Additionally, we propose a latent-as-query co-attention module to autonomously discover fusion patterns, circumventing modality prioritization biases while minimizing feature redundancy. We conduct extensive experiments to confirm the effectiveness of our method and highlight the correspondance between fMRI and sMRI as AD biomarkers.
Multi-Modal Fusion of In-Situ Video Data and Process Parameters for Online Forecasting of Cookie Drying Readiness
Food drying is essential for food production, extending shelf life, and reducing transportation costs. Accurate real-time forecasting of drying readiness is crucial for minimizing energy consumption, improving productivity, and ensuring product quality. However, this remains challenging due to the dynamic nature of drying, limited data availability, and the lack of effective predictive analytical methods. To address this gap, we propose an end-to-end multi-modal data fusion framework that integrates in-situ video data with process parameters for real-time food drying readiness forecasting. Our approach leverages a new encoder-decoder architecture with modality-specific encoders and a transformer-based decoder to effectively extract features while preserving the unique structure of each modality. We apply our approach to sugar cookie drying, where time-to-ready is predicted at each timestamp. Experimental results demonstrate that our model achieves an average prediction error of only 15 seconds, outperforming state-of-the-art data fusion methods by 65.69% and a video-only model by 11.30%. The proposed model is extensible to various other industrial modality fusion tasks for online decision-making. Introduction Drying is a fundamental process in the food industry that plays a critical role in both food production and preservation. By removing moisture, it transforms raw ingredients into their final, consumable forms while enhancing texture, flavor, and structural integrity [1]. However, food drying is a highly time-and energy-intensive process which accounts for 15% of energy consumption in U.S. industrial processes [2]. As a result, advancing drying technologies and improving product quality are key strategies for minimizing waste and enhancing energy efficiency [3].
Enhanced UAV Navigation Systems through Sensor Fusion with Trident Quaternions
Incicco, Sebastian, Giribet, Juan Ignacio, Colombo, Leonardo
Integrated Navigation (IN) techniques have emerged as a promising solution by combining multiple sensor measurements, such as those obtained from Inertial Measurement Units (IMU), Global Navigation Satellite Systems (GNSS), and vision-based sensors. IN approaches offer significant advantages, including robustness, improved accuracy, and the ability to overcome the limitations of individual sensors. Among the various mathematical tools employed in IN, quaternions have garnered considerable attention for estimating a vehicle's attitude (orientation). Quaternions provide an elegant and compact representation of orientation, avoiding the limitations of traditional Euler angles, such as singularities and ambiguity.
Translating Multimodal AI into Real-World Inspection: TEMAI Evaluation Framework and Pathways for Implementation
Li, Zehan, Deng, Jinzhi, Ma, Haibing, Zhang, Chi, Xiao, Dan
Translating Multimodal AI into Real-World Inspection: TEMAI Evaluation Framework and Pathways for Implementation Zehan LI 1,3, Jinzhi Deng 1,2, Haibing Ma 1,2, Chi Zhang 1, and Dan Xiao 1 1 Moximize.ai 2 Shanghai Zhongqiao Vocational And Technical University 3 China Creative Studies Institute April 22, 2025 Abstract This paper introduces the Translational Evaluation of Multimodal AI for Inspection (TEMAI) framework, bridging multimodal AI capabilities with industrial inspection implementation. Adapting translational research principles from healthcare to industrial contexts, TEMAI establishes three core dimensions: Capability (technical feasibility), Adoption (organizational readiness), and Utility (value realization). The framework demonstrates that technical capability alone yields limited value without corresponding adoption mechanisms. TEMAI incorporates specialized metrics including the Value Density Coefficient and structured implementation pathways. Empirical validation through retail and photovoltaic inspection implementations revealed significant differences in value realization patterns despite similar capability reduction rates, confirming the framework's effectiveness across diverse industrial sectors while highlighting the importance of industry-specific adaptation strategies. Keywords: Multimodal AI, Industrial Inspection, Translational Framework, TEMAI 1 Introduction Industrial inspection tasks are fundamental to ensuring operational continuity and safety in manufacturing sectors, serving as a cornerstone for preventive maintenance and risk mitigation. These tasks, however, are plagued by systemic inefficiencies, including labor-intensive workflows, hazardous working environments (e.g., high-temperature zones or toxic gas exposure), and heavy reliance on empirical knowledge that is difficult to standardize or transfer across industries[1]. Despite incremental advancements in automation technologies--such as drones, AR-assisted devices, and IoT-enabled sensors--the integration of these tools into inspection workflows has yielded limited returns due to fragmented deployment, high implementation costs, and insufficient interoperability between hardware and software systems [2]. For instance, while drones have reduced human exposure to dangerous environments in power grid inspections, their operational scope remains constrained by battery life and data processing bottlenecks[3].
DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis
Georgiou, Efthymios, Katsouros, Vassilis, Avrithis, Yannis, Potamianos, Alexandros
While multimodal fusion has been extensively studied in Multimodal Sentiment Analysis (MSA), the role of fusion depth and multimodal capacity allocation remains underexplored. In this work, we position fusion depth, scalability, and dedicated multimodal capacity as primary factors for effective fusion. We introduce DeepMLF, a novel multimodal language model (LM) with learnable tokens tailored toward deep fusion. DeepMLF leverages an audiovisual encoder and a pretrained decoder LM augmented with multimodal information across its layers. We append learnable tokens to the LM that: 1) capture modality interactions in a controlled fashion and 2) preserve independent information flow for each modality. These fusion tokens gather linguistic information via causal self-attention in LM Blocks and integrate with audiovisual information through cross-attention MM Blocks. Serving as dedicated multimodal capacity, this design enables progressive fusion across multiple layers, providing depth in the fusion process. Our training recipe combines modality-specific losses and language modelling loss, with the decoder LM tasked to predict ground truth polarity. Across three MSA benchmarks with varying dataset characteristics, DeepMLF achieves state-of-the-art performance. Our results confirm that deeper fusion leads to better performance, with optimal fusion depths (5-7) exceeding those of existing approaches. Additionally, our analysis on the number of fusion tokens reveals that small token sets ($\sim$20) achieve optimal performance. We examine the importance of representation learning order (fusion curriculum) through audiovisual encoder initialization experiments. Our ablation studies demonstrate the superiority of the proposed fusion design and gating while providing a holistic examination of DeepMLF's scalability to LLMs, and the impact of each training objective and embedding regularization.