Information Fusion
FedDiff: Diffusion Model Driven Federated Learning for Multi-Modal and Multi-Clients
Li, DaiXun, Xie, Weiying, Wang, ZiXuan, Lu, YiBing, Li, Yunsong, Fang, Leyuan
With the rapid development of imaging sensor technology in the field of remote sensing, multi-modal remote sensing data fusion has emerged as a crucial research direction for land cover classification tasks. While diffusion models have made great progress in generative models and image classification tasks, existing models primarily focus on single-modality and single-client control, that is, the diffusion process is driven by a single modal in a single computing node. To facilitate the secure fusion of heterogeneous data from clients, it is necessary to enable distributed multi-modal control, such as merging the hyperspectral data of organization A and the LiDAR data of organization B privately on each base station client. In this study, we propose a multi-modal collaborative diffusion federated learning framework called FedDiff. Our framework establishes a dual-branch diffusion model feature extraction setup, where the two modal data are inputted into separate branches of the encoder. Our key insight is that diffusion models driven by different modalities are inherently complementary in terms of potential denoising steps on which bilateral connections can be built. Considering the challenge of private and efficient communication between multiple clients, we embed the diffusion model into the federated learning communication structure, and introduce a lightweight communication module. Qualitative and quantitative experiments validate the superiority of our framework in terms of image quality and conditional consistency.
Identification and Estimation for Nonignorable Missing Data: A Data Fusion Approach
Wang, Zixiao, Ghassami, AmirEmad, Shpitser, Ilya
Missing data is a pervasive and challenging issue in various applications of statistical inference, such as healthcare, economics, and the social sciences. Data are said to be Missing at Random (MAR) when the mechanism of missingness depends only on the observed data. Strategies to deal with MAR have been extensively investigated in the literature (Dempster et al., 1977; Robins et al., 1994; Tsiatis, 2006; Little and Rubin, 2019). In many practical settings, MAR is not a realistic assumption. Instead, missingness often depends on variables that are themselves missing. Such settings are said to exhibit nonignorable missingness, with the resulting data being Missing Not at Random (MNAR) (Fielding et al., 2008; Schafer and Graham, 2002), A classic example of a scenario with MNAR data occurs in longitudinal studies, due to the treatment's toxicity, some patients may become too ill to visit the clinic, leading to the situation where the outcome of certain patients with circumstances associated with those outcomes are more likely to be lost to follow-up (Ibrahim et al., 2012). Previous MNAR models have typically imposed constraints on the target distribution and its missingness mechanism, ensuring the parameter of interest can be identified. This approach goes back to the work of Heckman (1979), who proposed an outcome-selection model based on parametric modeling of the outcome variable and missing pattern. Little (1993) introduced the pattern-mixture model where one needs to specify the distribution for each missing data pattern independently.
Low-light Pedestrian Detection in Visible and Infrared Image Feeds: Issues and Challenges
Vachhani, Hrishikesh, Akilan, Thangarajah, Devmurari, Yash, Shaik, Nisharaff, Patel, Dhruvisha
Pedestrian detection has become a cornerstone of several vision-based applications in modern Intelligent Transportation Systems (ITS), security, and surveillance. It is the process of identifying human movements in input feeds from data acquisition devices, like visual cameras, and thermopile sensors, for semantic understanding of a scene. It is more significant than other forms of object detection since it addresses the safety concerns of the people. Thus, it has stringent operational criteria, such as higher detection accuracy and real-time performance, which are of paramount importance to the aforesaid smart systems. To address this, there are several methods have been introduced by computer vision and machine learning researchers through exploiting technological advancements as illustrated in Figure 1.
Knowledge Graph Representations to enhance Intensive Care Time-Series Predictions
Jain, Samyak, Burger, Manuel, Rรคtsch, Gunnar, Kuznetsova, Rita
Intensive Care Units (ICU) require comprehensive patient data integration for enhanced clinical outcome predictions, crucial for assessing patient conditions. Recent deep learning advances have utilized patient time series data, and fusion models have incorporated unstructured clinical reports, improving predictive performance. However, integrating established medical knowledge into these models has not yet been explored. The medical domain's data, rich in structural relationships, can be harnessed through knowledge graphs derived from clinical ontologies like the Unified Medical Language System (UMLS) for better predictions. Our proposed methodology integrates this knowledge with ICU data, improving clinical decision modeling. It combines graph representations with vital signs and clinical reports, enhancing performance, especially when data is missing. Additionally, our model includes an interpretability component to understand how knowledge graph nodes affect predictions.
A Federated Data Fusion-Based Prognostic Model for Applications with Multi-Stream Incomplete Signals
Industrial prognostic aims to predict the failure time of machines by utilizing their degradation signals. This is typically achieved by establishing a statistical learning model that maps the degradation signals of machines to their time-to-failure (TTFs) [1, 2]. Similar to that of many other statistical learning models, the implementation of prognostic models usually consists of two steps: model training and real-time monitoring (also known as model testing or deployment). Model training focuses on using a historical dataset that comprises the degradation signals and TTFs of some failed machines to estimate the parameters of the prognostic model; real-time monitoring feeds the real-time degradation signals from a partially degraded onsite machine into the prognostic model trained earlier to predict its TTF or TTF distribution. Most existing prognostic models assume that a historical dataset from a decent number of failed machines is available for model training [3, 4, 5, 6, 7]. In reality, however, the amount of historical data owned by a single organization (e.g., a company, a university lab, a factory, etc.) might be small or not large enough to train a reliable prognostic model.
GICI-LIB: A GNSS/INS/Camera Integrated Navigation Library
Chi, Cheng, Zhang, Xin, Liu, Jiahui, Sun, Yulong, Zhang, Zihao, Zhan, Xingqun
Accurate navigation is essential for autonomous robots and vehicles. In recent years, the integration of the Global Navigation Satellite System (GNSS), Inertial Navigation System (INS), and camera has garnered considerable attention due to its robustness and high accuracy in diverse environments. However, leveraging the full capacity of GNSS is cumbersome because of the diverse choices of formulations, error models, satellite constellations, signal frequencies, and service types, which lead to different precision, robustness, and usage dependencies. To clarify the capacity of GNSS algorithms and accelerate the development efficiency of employing GNSS in multi-sensor fusion algorithms, we open source the GNSS/INS/Camera Integration Library (GICI-LIB), together with detailed documentation and a comprehensive land vehicle dataset. A factor graph optimization-based multi-sensor fusion framework is established, which combines almost all GNSS measurement error sources by fully considering temporal and spatial correlations between measurements. The graph structure is designed for flexibility, making it easy to form any kind of integration algorithm. For illustration, Real-Time Kinematic (RTK), Precise Point Positioning (PPP), and four RTK-based algorithms from GICI-LIB are evaluated using our dataset and public datasets. Results confirm the potential of the GICI system to provide continuous precise navigation solutions in a wide spectrum of urban environments.
Neuro-Inspired Hierarchical Multimodal Learning
Xiao, Xiongye, Liu, Gengshuo, Gupta, Gaurav, Cao, Defu, Li, Shixuan, Li, Yaxing, Fang, Tianqing, Cheng, Mingxi, Bogdan, Paul
Integrating and processing information from various sources or modalities are critical for obtaining a comprehensive and accurate perception of the real world. Drawing inspiration from neuroscience, we develop the Information-Theoretic Hierarchical Perception (ITHP) model, which utilizes the concept of information bottleneck. Distinct from most traditional fusion models that aim to incorporate all modalities as input, our model designates the prime modality as input, while the remaining modalities act as detectors in the information pathway. Our proposed perception model focuses on constructing an effective and compact information flow by achieving a balance between the minimization of mutual information between the latent state and the input modal state, and the maximization of mutual information between the latent states and the remaining modal states. This approach leads to compact latent state representations that retain relevant information while minimizing redundancy, thereby substantially enhancing the performance of downstream tasks. Experimental evaluations on both the MUStARD and CMU-MOSI datasets demonstrate that our model consistently distills crucial information in multimodal learning scenarios, outperforming state-of-the-art benchmarks. Remarkably, on the CMU-MOSI dataset, ITHP-DeBERTa surpasses human-level performance in the multimodal sentiment binary classification task across all evaluation metrics (i.e., Binary Accuracy, F1 Score, Mean Absolute Error, and Pearson Correlation).
An Efficient Probabilistic Solution to Mapping Errors in LiDAR-Camera Fusion for Autonomous Vehicles
Shen, Dan, Zhang, Zhengming, Tian, Renran, Chen, Yaobin, Sherony, Rini
LiDAR-camera fusion is one of the core processes for the perception system of current automated driving systems. The typical sensor fusion process includes a list of coordinate transformation operations following system calibration. Although a significant amount of research has been done to improve the fusion accuracy, there are still inherent data mapping errors in practice related to system synchronization offsets, vehicle vibrations, the small size of the target, and fast relative moving speeds. Moreover, more and more complicated algorithms to improve fusion accuracy can overwhelm the onboard computational resources, limiting the actual implementation. This study proposes a novel and low-cost probabilistic LiDAR-Camera fusion method to alleviate these inherent mapping errors in scene reconstruction. By calculating shape similarity using KL-divergence and applying RANSAC-regression-based trajectory smoother, the effects of LiDAR-camera mapping errors are minimized in object localization and distance estimation. Designed experiments are conducted to prove the robustness and effectiveness of the proposed strategy.
Enhanced Information Extraction from Cylindrical Visual-Tactile Sensors via Image Fusion
Li, Zilan, Zou, Zhibin, Xu, Weiliang, Zhou, Yuanzhi, Zhou, Guoyuan, Huang, Xuan, Li, Xinming
Vision-based tactile sensors equipped with planar contact structures acquire the shape, force, and motion states of objects in contact. The limited planar contact area presents a challenge in acquiring information about larger target objects. In contrast, vision-based tactile sensors with cylindrical contact structures could extend the contact area by rolling, which can acquire much tactile information that exceeds the sensing projection area in a single contact. However, the tactile data acquired by cylindrical structures does not consistently correspond to the same depth level. Therefore, stitching and analyzing the data in an extended contact area is a challenging problem. In this work, we propose an image fusion method based on cylindrical vision-based tactile sensors. The method takes advantage of the changing characteristics of the contact depth of cylindrical structures, extracts the effective information of different contact depths in the frequency domain, and performs differential fusion for the information characteristics. The results show that in object contact confronting an area larger than single sensing, the images fused with our proposed method have higher information and structural similarity compared with the method of stitching based on motion distance sampling. Meanwhile, it is robust to sampling time. We complement this method with a deep neural network to illustrate its potential for fusing and recognizing object contact information using cylindrical vision-based tactile sensors.
Multimodal Stress Detection Using Facial Landmarks and Biometric Signals
Hosseini, Majid, Bodaghi, Morteza, Bhupatiraju, Ravi Teja, Maida, Anthony, Gottumukkala, Raju
The development of various sensing technologies is improving measurements of stress and the well-being of individuals. Although progress has been made with single signal modalities like wearables and facial emotion recognition, integrating multiple modalities provides a more comprehensive understanding of stress, given that stress manifests differently across different people. Multi-modal learning aims to capitalize on the strength of each modality rather than relying on a single signal. Given the complexity of processing and integrating high-dimensional data from limited subjects, more research is needed. Numerous research efforts have been focused on fusing stress and emotion signals at an early stage, e.g., feature-level fusion using basic machine learning methods and 1D-CNN Methods. This paper proposes a multi-modal learning approach for stress detection that integrates facial landmarks and biometric signals. We test this multi-modal integration with various early-fusion and late-fusion techniques to integrate the 1D-CNN model from biometric signals and 2-D CNN using facial landmarks. We evaluate these architectures using a rigorous test of models' generalizability using the leave-one-subject-out mechanism, i.e., all samples related to a single subject are left out to train the model. Our findings show that late-fusion achieved 94.39\% accuracy, and early-fusion surpassed it with a 98.38\% accuracy rate. This research contributes valuable insights into enhancing stress detection through a multi-modal approach. The proposed research offers important knowledge in improving stress detection using a multi-modal approach.