Information Fusion
A Review of Vision-Language Models and their Performance on the Hateful Memes Challenge
Zhao, Bryan, Zhang, Andrew, Watson, Blake, Kearney, Gillian, Dale, Isaac
Moderation of social media content is currently a highly manual task, yet there is too much content posted daily to do so effectively. With the advent of a number of multimodal models, there is the potential to reduce the amount of manual labor for this task. In this work, we aim to explore different models and determine what is most effective for the Hateful Memes Challenge, a challenge by Meta designed to further machine learning research in content moderation. Specifically, we explore the differences between early fusion and late fusion models in classifying multimodal memes containing text and images. We first implement a baseline using unimodal models for text and images separately using BERT and ResNet-152, respectively. The outputs from these unimodal models were then concatenated together to create a late fusion model. In terms of early fusion models, we implement ConcatBERT, VisualBERT, ViLT, CLIP, and BridgeTower. It was found that late fusion performed significantly worse than early fusion models, with the best performing model being CLIP which achieved an AUROC of 70.06. The code for this work is available at https://github.com/bzhao18/CS-7643-Project.
A sensor fusion approach for improving implementation speed and accuracy of RTAB-Map algorithm based indoor 3D mapping
Phan, Hoang-Anh, Nguyen, Phuc Vinh, Khuat, Thu Hang Thi, Van, Hieu Dang, Tran, Dong Huu Quoc, Dang, Bao Lam, Bui, Tung Thanh, Thanh, Van Nguyen Thi, Duc, Trinh Chu
In recent years, 3D mapping for indoor environments has undergone considerable research and improvement because of its effective applications in various fields, including robotics, autonomous navigation, and virtual reality. Building an accurate 3D map for indoor environment is challenging due to the complex nature of the indoor space, the problem of real-time embedding and positioning errors of the robot system. This study proposes a method to improve the accuracy, speed, and quality of 3D indoor mapping by fusing data from the Inertial Measurement System (IMU) of the Intel Realsense D435i camera, the Ultrasonic-based Indoor Positioning System (IPS), and the encoder of the robot's wheel using the extended Kalman filter (EKF) algorithm. The merged data is processed using a Real-time Image Based Mapping algorithm (RTAB-Map), with the processing frequency updated in synch with the position frequency of the IPS device. The results suggest that fusing IMU and IPS data significantly improves the accuracy, mapping time, and quality of 3D maps. Our study highlights the proposed method's potential to improve indoor mapping in various fields, indicating that the fusion of multiple data sources can be a valuable tool in creating high-quality 3D indoor maps.
On the Fusion Strategies for Federated Decision Making
Kayaalp, Mert, Inan, Yunus, Koivunen, Visa, Telatar, Emre, Sayed, Ali H.
ABSTRACT We consider the problem of information aggregation in federated decision making, where a group of agents collaborate to infer the underlying state of nature without sharing their private data with the central processor or each other. We analyze the non-Bayesian social learning strategy in which agents incorporate their individual observations into their opinions (i.e., soft-decisions) with Bayes rule, and the central processor aggregates these opinions by arithmetic or geometric averaging. Building on our previous work, we establish that both pooling strategies result in asymptotic normality characterization of the system, which, for instance, can be utilized to derive approximate expressions for the error probability. We verify the theoretical findings with simulations and compare both strategies. Figure 1: Data types at the edge devices can be highly heterogeneous.
Empowering Language Model with Guided Knowledge Fusion for Biomedical Document Re-ranking
Gupta, Deepak, Demner-Fushman, Dina
Pre-trained language models (PLMs) have proven to be effective for document re-ranking task. However, they lack the ability to fully interpret the semantics of biomedical and health-care queries and often rely on simplistic patterns for retrieving documents. To address this challenge, we propose an approach that integrates knowledge and the PLMs to guide the model toward effectively capturing information from external sources and retrieving the correct documents. We performed comprehensive experiments on two biomedical and open-domain datasets that show that our approach significantly improves vanilla PLMs and other existing approaches for document re-ranking task.
MIXER: Multiattribute, Multiway Fusion of Uncertain Pairwise Affinities
Lusk, Parker C., Fathian, Kaveh, How, Jonathan P.
We present a multiway fusion algorithm capable of directly processing uncertain pairwise affinities. In contrast to existing works that require initial pairwise associations, our MIXER algorithm improves accuracy by leveraging the additional information provided by pairwise affinities. Our main contribution is a multiway fusion formulation that is particularly suited to processing non-binary affinities and a novel continuous relaxation whose solutions are guaranteed to be binary, thus avoiding the typical, but potentially problematic, solution binarization steps that may cause infeasibility. A crucial insight of our formulation is that it allows for three modes of association, ranging from non-match, undecided, and match. Exploiting this insight allows fusion to be delayed for some data pairs until more information is available, which is an effective feature for fusion of data with multiple attributes/information sources. We evaluate MIXER on typical synthetic data and benchmark datasets and show increased accuracy against the state of the art in multiway matching, especially in noisy regimes with low observation redundancy. Additionally, we collect RGB data of cars in a parking lot to demonstrate MIXER's ability to fuse data having multiple attributes (color, visual appearance, and bounding box). On this challenging dataset, MIXER achieves 74% F1 accuracy and is 49x faster than the next best algorithm, which has 42% accuracy. Open source code is available at https://github.com/mit-acl/mixer.
Fusion for Visual-Infrared Person ReID in Real-World Surveillance Using Corrupted Multimodal Data
Josi, Arthur, Alehdaghi, Mahdi, Cruz, Rafael M. O., Granger, Eric
Visible-infrared person re-identification (V-I ReID) seeks to match images of individuals captured over a distributed network of RGB and IR cameras. The task is challenging due to the significant differences between V and I modalities, especially under real-world conditions, where images are corrupted by, e.g, blur, noise, and weather. Indeed, state-of-art V-I ReID models cannot leverage corrupted modality information to sustain a high level of accuracy. In this paper, we propose an efficient model for multimodal V-I ReID -- named Multimodal Middle Stream Fusion (MMSF) -- that preserves modality-specific knowledge for improved robustness to corrupted multimodal images. In addition, three state-of-art attention-based multimodal fusion models are adapted to address corrupted multimodal data in V-I ReID, allowing to dynamically balance each modality importance. Recently, evaluation protocols have been proposed to assess the robustness of ReID models under challenging real-world scenarios. However, these protocols are limited to unimodal V settings. For realistic evaluation of multimodal (and cross-modal) V-I person ReID models, we propose new challenging corrupted datasets for scenarios where V and I cameras are co-located (CL) and not co-located (NCL). Finally, the benefits of our Masking and Local Multimodal Data Augmentation (ML-MDA) strategy are explored to improve the robustness of ReID models to multimodal corruption. Our experiments on clean and corrupted versions of the SYSU-MM01, RegDB, and ThermalWORLD datasets indicate the multimodal V-I ReID models that are more likely to perform well in real-world operational conditions. In particular, our ML-MDA is an important strategy for a V-I person ReID system to sustain high accuracy and robustness when processing corrupted multimodal images. Also, our multimodal ReID model MMSF outperforms every method under CL and NCL camera scenarios.
Multimodal Affective States Recognition Based on Multiscale CNNs and Biologically Inspired Decision Fusion Model
Zhao, Yuxuan, Cao, Xinyan, Lin, Jinlong, Yu, Dunshan, Cao, Xixin
There has been an encouraging progress in the affective states recognition models based on the single-modality signals as electroencephalogram (EEG) signals or peripheral physiological signals in recent years. However, multimodal physiological signals-based affective states recognition methods have not been thoroughly exploited yet. Here we propose Multiscale Convolutional Neural Networks (Multiscale CNNs) and a biologically inspired decision fusion model for multimodal affective states recognition. Firstly, the raw signals are pre-processed with baseline signals. Then, the High Scale CNN and Low Scale CNN in Multiscale CNNs are utilized to predict the probability of affective states output for EEG and each peripheral physiological signal respectively. Finally, the fusion model calculates the reliability of each single-modality signals by the Euclidean distance between various class labels and the classification probability from Multiscale CNNs, and the decision is made by the more reliable modality information while other modalities information is retained. We use this model to classify four affective states from the arousal valence plane in the DEAP and AMIGOS dataset. The results show that the fusion model improves the accuracy of affective states recognition significantly compared with the result on single-modality signals, and the recognition accuracy of the fusion result achieve 98.52% and 99.89% in the DEAP and AMIGOS dataset respectively.
A Cooperative Perception System Robust to Localization Errors
Song, Zhiying, Wen, Fuxi, Zhang, Hailiang, Li, Jun
Cooperative perception is challenging for safety-critical autonomous driving applications.The errors in the shared position and pose cause an inaccurate relative transform estimation and disrupt the robust mapping of the Ego vehicle. We propose a distributed object-level cooperative perception system called OptiMatch, in which the detected 3D bounding boxes and local state information are shared between the connected vehicles. To correct the noisy relative transform, the local measurements of both connected vehicles (bounding boxes) are utilized, and an optimal transport theory-based algorithm is developed to filter out those objects jointly detected by the vehicles along with their correspondence, constructing an associated co-visible set. A correction transform is estimated from the matched object pairs and further applied to the noisy relative transform, followed by global fusion and dynamic mapping. Experiment results show that robust performance is achieved for different levels of location and heading errors, and the proposed framework outperforms the state-of-the-art benchmark fusion schemes, including early, late, and intermediate fusion, on average precision by a large margin when location and/or heading errors occur.
Data-driven Knowledge Fusion for Deep Multi-instance Learning
Zhang, Yu-Xuan, Zhou, Zhengchun, He, Xingxing, Adhikary, Avik Ranjan, Dutta, Bapi
Multi-instance learning (MIL) is a widely-applied technique in practical applications that involve complex data structures. MIL can be broadly categorized into two types: traditional methods and those based on deep learning. These approaches have yielded significant results, especially with regards to their problem-solving strategies and experimental validation, providing valuable insights for researchers in the MIL field. However, a considerable amount of knowledge is often trapped within the algorithm, leading to subsequent MIL algorithms that solely rely on the model's data fitting to predict unlabeled samples. This results in a significant loss of knowledge and impedes the development of more intelligent models. In this paper, we propose a novel data-driven knowledge fusion for deep multi-instance learning (DKMIL) algorithm. DKMIL adopts a completely different idea from existing deep MIL methods by analyzing the decision-making of key samples in the data set (referred to as the data-driven) and using the knowledge fusion module designed to extract valuable information from these samples to assist the model's training. In other words, this module serves as a new interface between data and the model, providing strong scalability and enabling the use of prior knowledge from existing algorithms to enhance the learning ability of the model. Furthermore, to adapt the downstream modules of the model to more knowledge-enriched features extracted from the data-driven knowledge fusion module, we propose a two-level attention module that gradually learns shallow- and deep-level features of the samples to achieve more effective classification. We will prove the scalability of the knowledge fusion module while also verifying the efficacy of the proposed architecture by conducting experiments on 38 data sets across 6 categories.
Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation
Maheshwari, Harsh, Liu, Yen-Cheng, Kira, Zsolt
Using multiple spatial modalities has been proven helpful in improving semantic segmentation performance. However, there are several real-world challenges that have yet to be addressed: (a) improving label efficiency and (b) enhancing robustness in realistic scenarios where modalities are missing at the test time. To address these challenges, we first propose a simple yet efficient multi-modal fusion mechanism Linear Fusion, that performs better than the state-of-the-art multi-modal models even with limited supervision. Second, we propose M3L: Multi-modal Teacher for Masked Modality Learning, a semi-supervised framework that not only improves the multi-modal performance but also makes the model robust to the realistic missing modality scenario using unlabeled data. We create the first benchmark for semi-supervised multi-modal semantic segmentation and also report the robustness to missing modalities. Our proposal shows an absolute improvement of up to 10% on robust mIoU above the most competitive baselines. Our code is available at https://github.com/harshm121/M3L