Information Fusion
BayReL: Bayesian Relational Learning for Multi-omics Data Integration
High-throughput molecular profiling technologies have produced high-dimensional multi-omics data, enabling systematic understanding of living systems at the genome scale. Studying molecular interactions across different data types helps reveal signal transduction mechanisms across different classes of molecules. In this paper, we develop a novel Bayesian representation learning method that infers the relational interactions across multi-omics data types. Our method, Bayesian Relational Learning (BayReL) for multi-omics data integration, takes advantage of a priori known relationships among the same class of molecules, modeled as a graph at each corresponding view, to learn view-specific latent variables as well as a multi-partite graph that encodes the interactions across views. Our experiments on several real-world datasets demonstrate enhanced performance of BayReL in inferring meaningful interactions compared to existing baselines.
Kalman Filter, Sensor Fusion, and Constrained Regression: Equivalences and Insights
The Kalman filter (KF) is one of the most widely used tools for data assimilation and sequential estimation. In this work, we show that the state estimates from the KF in a standard linear dynamical system setting are equivalent to those given by the KF in a transformed system, with infinite process noise (i.e., a flat prior'') and an augmented measurement space. This reformulation---which we refer to as augmented measurement sensor fusion (SF)---is conceptually interesting, because the transformed system here is seemingly static (as there is effectively no process model), but we can still capture the state dynamics inherent to the KF by folding the process model into the measurement space. Further, this reformulation of the KF turns out to be useful in settings in which past states are observed eventually (at some lag). Here, when the measurement noise covariance is estimated by the empirical covariance, we show that the state predictions from SF are equivalent to those from a regression of past states on past measurements, subject to particular linear constraints (reflecting the relationships encoded in the measurement map). This allows us to port standard ideas (say, regularization methods) in regression over to dynamical systems.
On Single Source Robustness in Deep Fusion Models
Algorithms that fuse multiple input sources benefit from both complementary and shared information. Shared information may provide robustness against faulty or noisy inputs, which is indispensable for safety-critical applications like self-driving cars. We investigate learning fusion algorithms that are robust against noise added to a single source. We first demonstrate that robustness against single source noise is not guaranteed in a linear fusion model. Motivated by this discovery, two possible approaches are proposed to increase robustness: a carefully designed loss with corresponding training algorithms for deep fusion models, and a simple convolutional fusion layer that has a structural advantage in dealing with noise.
A Visual Cooperative Localization Method for Airborne Magnetic Surveying Based on a Manifold Sensor Fusion Algorithm Using Lie Groups
Liu, Liang, Hu, Xiao, Jiang, Wei, Meng, Guanglei, Wang, Zhujun, Zhang, Taining
Recent advancements in UAV technology have spurred interest in developing multi-UAV aerial surveying systems for use in confined environments where GNSS signals are blocked or jammed. This paper focuses airborne magnetic surveying scenarios. To obtain clean magnetic measurements reflecting the Earth's magnetic field, the magnetic sensor must be isolated from other electronic devices, creating a significant localization challenge. We propose a visual cooperative localization solution. The solution incorporates a visual processing module and an improved manifold-based sensor fusion algorithm, delivering reliable and accurate positioning information. Real flight experiments validate the approach, demonstrating single-axis centimeter-level accuracy and decimeter-level overall 3D positioning accuracy.
BayesIMP: Uncertainty Quantification for Causal Data Fusion
While causal models are becoming one of the mainstays of machine learning, the problem of uncertainty quantification in causal inference remains challenging. In this paper, we study the causal data fusion problem, where data arising from multiple causal graphs are combined to estimate the average treatment effect of a target variable. As data arises from multiple sources and can vary in quality and sample size, principled uncertainty quantification becomes essential. To that end, we introduce \emph{Bayesian Causal Mean Processes}, the framework which combines ideas from probabilistic integration and kernel mean embeddings to represent interventional distributions in the reproducing kernel Hilbert space, while taking into account the uncertainty within each causal graph. To demonstrate the informativeness of our uncertainty estimation, we apply our method to the Causal Bayesian Optimisation task and show improvements over state-of-the-art methods.
Liver Cancer Knowledge Graph Construction based on dynamic entity replacement and masking strategies RoBERTa-BiLSTM-CRF model
Zhang, YiChi, Wang, HaiLing, Gao, YongBin, Hu, XiaoJun, Fan, YingFang, Fang, ZhiJun
Background: Liver cancer ranks as the fifth most common malignant tumor and the second most fatal in our country. Early diagnosis is crucial, necessitating that physicians identify liver cancer in patients at the earliest possible stage. However, the diagnostic process is complex and demanding. Physicians must analyze a broad spectrum of patient data, encompassing physical condition, symptoms, medical history, and results from various examinations and tests, recorded in both structured and unstructured medical formats. This results in a significant workload for healthcare professionals. In response, integrating knowledge graph technology to develop a liver cancer knowledge graph-assisted diagnosis and treatment system aligns with national efforts toward smart healthcare. Such a system promises to mitigate the challenges faced by physicians in diagnosing and treating liver cancer. Methods: This paper addresses the major challenges in building a knowledge graph for hepatocellular carcinoma diagnosis, such as the discrepancy between public data sources and real electronic medical records, the effective integration of which remains a key issue. The knowledge graph construction process consists of six steps: conceptual layer design, data preprocessing, entity identification, entity normalization, knowledge fusion, and graph visualization. A novel Dynamic Entity Replacement and Masking Strategy (DERM) for named entity recognition is proposed. Results: A knowledge graph for liver cancer was established, including 7 entity types such as disease, symptom, and constitution, containing 1495 entities. The recognition accuracy of the model was 93.23%, the recall was 94.69%, and the F1 score was 93.96%.
Learning Content-Aware Multi-Modal Joint Input Pruning via Bird's-Eye-View Representation
Li, Yuxin, Li, Yiheng, Yang, Xulei, Yu, Mengying, Huang, Zihang, Wu, Xiaojun, Yeo, Chai Kiat
In the landscape of autonomous driving, Bird's-Eye-View (BEV) representation has recently garnered substantial academic attention, serving as a transformative framework for the fusion of multi-modal sensor inputs. This BEV paradigm effectively shifts the sensor fusion challenge from a rule-based methodology to a data-centric approach, thereby facilitating more nuanced feature extraction from an array of heterogeneous sensors. Notwithstanding its evident merits, the computational overhead associated with BEV-based techniques often mandates high-capacity hardware infrastructures, thus posing challenges for practical, real-world implementations. To mitigate this limitation, we introduce a novel content-aware multi-modal joint input pruning technique. Our method leverages BEV as a shared anchor to algorithmically identify and eliminate non-essential sensor regions prior to their introduction into the perception model's backbone. We validatethe efficacy of our approach through extensive experiments on the NuScenes dataset, demonstrating substantial computational efficiency without sacrificing perception accuracy. To the best of our knowledge, this work represents the first attempt to alleviate the computational burden from the input pruning point.
STNet: Deep Audio-Visual Fusion Network for Robust Speaker Tracking
Li, Yidi, Liu, Hong, Yang, Bing
--Audio-visual speaker tracking aims to determine the location of human targets in a scene using signals captured by a multi-sensor platform, whose accuracy and robustness can be improved by multi-modal fusion methods. Recently, several fusion methods have been proposed to model the correlation in multiple modalities. However, for the speaker tracking problem, the cross-modal interaction between audio and visual signals hasn't been well exploited. T o this end, we present a novel Speaker Tracking Network (STNet) with a deep audio-visual fusion model in this work. We design a visual-guided acoustic measurement method to fuse heterogeneous cues in a unified localization space, which employs visual observations via a camera model to construct the enhanced acoustic map. For feature fusion, a cross-modal attention module is adopted to jointly model multi-modal contexts and interactions. The correlated information between audio and visual features is further interacted in the fusion model. Moreover, the STNet-based tracker is applied to multi-speaker cases by a quality-aware module, which evaluates the reliability of multi-modal observations to achieve robust tracking in complex scenarios. Experiments on the A V16.3 and CA V3D datasets show that the proposed STNet-based tracker outperforms uni-modal methods and state-of-the-art audio-visual speaker trackers. PEAKER tracking is a fundamental task in human-computer interaction that determines the position of the speaker in each time step by analyzing data from sensors such as microphones and cameras [1]. It has wide applications in intelligent surveillance [2], multimedia systems [3], and robot navigation [4]. In general, the basic approaches for solving the tracking problem include computer vision-based face or body tracking methods [5-7] and auditory-based Sound Source Localization (SSL) methods [8, 9]. However, it is difficult for uni-modal methods to adapt to complex dynamic environments. For example, visual trackers are susceptible to object occlusion and changes in illumination and appearance. Besides, acoustic tracking is not subject to visual interference, but the intermittent nature of speech signals, background noise, and room reverberation constrain the performance of SSL-based trackers. This work is supported by National Natural Science Foundation of China (No. 62403345).
Channel-Aware Throughput Maximization for Cooperative Data Fusion in CAV
An, Haonan, Fang, Zhengru, Zhang, Yuang, Hu, Senkang, Chen, Xianhao, Xu, Guowen, Fang, Yuguang
--Connected and autonomous vehicles (CA Vs) have garnered significant attention due to their extended perception range and enhanced sensing coverage. T o address challenges such as blind spots and obstructions, CA Vs employ vehicle-to-vehicle (V2V) communications to aggregate sensory data from surrounding vehicles. However, cooperative perception is often constrained by the limitations of achievable network throughput and channel quality. In this paper, we propose a channel-aware throughput maximization approach to facilitate CA V data fusion, leveraging a self-supervised autoencoder for adaptive data compression. We formulate the problem as a mixed integer programming (MIP) model, which we decompose into two sub-problems to derive optimal data rate and compression ratio solutions under given link conditions. An autoencoder is then trained to minimize bitrate with the determined compression ratio, and a fine-tuning strategy is employed to further reduce spectrum resource consumption. Experimental evaluation on the OpenCOOD platform demonstrates the effectiveness of our proposed algorithm, showing more than 20.19% improvement in network throughput and a 9.38% increase in average precision (AP@IoU) compared to state-of-the-art methods, with an optimal latency of 19.99 ms. Index T erms --Cooperative perception, throughput optimization, connected and autonomous driving (CA V). Recently, autonomous driving has emerged as a promising technology for smart cities. By leveraging communication and artificial intelligence (AI) technologies, autonomous driving can significantly enhance the performance of a city's transportation system. This improvement is achieved through real-time perception of road conditions and precise object detection from onboard sensors (such as radars, LiDARs, and cameras), thereby improving road safety without human intervention [1]. Moreover, the ability of autonomous vehicles to adapt to dynamic environments and communicate with surrounding infrastructure and vehicles is crucial for maintaining the timeliness and accuracy of collected data, thereby enhancing the overall system performance [2]-[9]. Joint perception among connected and autonomous vehicles (CA Vs) is a key enabler to overcome the limitations of individual agent sensing capabilities [10].
LEGO: Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion
Ding, Dexuan, Wang, Lei, Zhu, Liyun, Gedeon, Tom, Koniusz, Piotr
In computer vision tasks, features often come from diverse representations, domains, and modalities, such as text, images, and videos. Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing similarity graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we use graph power expansions and introduce a learnable graph fusion operator to combine these graph powers for more effective fusion. Our approach is relationship-centric, operates in a homogeneous space, and is mathematically principled, resembling element-wise similarity score aggregation via multilinear polynomials. We demonstrate the effectiveness of our graph-based fusion method on video anomaly detection, showing strong performance across multi-representational, multi-modal, and multi-domain feature fusion tasks.