Information Fusion
A Generalized Multi-Modal Fusion Detection Framework
Cui, Leichao, Li, Xiuxian, Meng, Min, Mo, Xiaoyu
LiDAR point clouds have become the most common data source in autonomous driving. However, due to the sparsity of point clouds, accurate and reliable detection cannot be achieved in specific scenarios. Because of their complementarity with point clouds, images are getting increasing attention. Although with some success, existing fusion methods either perform hard fusion or do not fuse in a direct manner. In this paper, we propose a generic 3D detection framework called MMFusion, using multi-modal features. The framework aims to achieve accurate fusion between LiDAR and images to improve 3D detection in complex scenes. Our framework consists of two separate streams: the LiDAR stream and the camera stream, which can be compatible with any single-modal feature extraction network. The Voxel Local Perception Module in the LiDAR stream enhances local feature representation, and then the Multi-modal Feature Fusion Module selectively combines feature output from different streams to achieve better fusion. Extensive experiments have shown that our framework not only outperforms existing benchmarks but also improves their detection, especially for detecting cyclists and pedestrians on KITTI benchmarks, with strong robustness and generalization capabilities. Hopefully, our work will stimulate more research into multi-modal fusion for autonomous driving tasks.
High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning
Liang, Paul Pu, Lyu, Yiwei, Fan, Xiang, Tsaw, Jeffrey, Liu, Yudong, Mo, Shentong, Yogatama, Dani, Morency, Louis-Philippe, Salakhutdinov, Ruslan
Many real-world problems are inherently multimodal, from spoken language, gestures, and paralinguistics humans use to communicate, to force, proprioception, and visual sensors on robots. While there has been an explosion of interest in multimodal learning, these methods are focused on a small set of modalities primarily in language, vision, and audio. In order to accelerate generalization towards diverse and understudied modalities, this paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities. Since adding new models for every new modality becomes prohibitively expensive, a critical technical challenge is heterogeneity quantification: how can we measure which modalities encode similar information and interactions in order to permit parameter sharing with previous modalities? This paper proposes two new information theoretic metrics for heterogeneity quantification: (1) modality heterogeneity studies how similar 2 modalities {X1,X2} are by measuring how much information can be transferred from X1 to X2, while (2) interaction heterogeneity studies how similarly pairs of modalities {X1,X2}, {X3,X4} interact by measuring how much information can be transferred from fusing {X1,X2} to {X3,X4}. We show the importance of these 2 proposed metrics as a way to automatically prioritize the fusion of modalities that contain unique information or interactions. The result is a single model, HighMMT, that scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas. Not only does HighMMT outperform prior methods on the tradeoff between performance and efficiency, it also demonstrates a crucial scaling behavior: performance continues to improve with each modality added, and it transfers to entirely new modalities and tasks during fine-tuning.
MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep Learning
Liang, Paul Pu, Lyu, Yiwei, Fan, Xiang, Agarwal, Arav, Cheng, Yun, Morency, Louis-Philippe, Salakhutdinov, Ruslan
Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiZoo, a public toolkit consisting of standardized implementations of > 20 core multimodal algorithms and MultiBench, a large-scale benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. Together, these provide an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, we offer a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MultiBench paves the way towards a better understanding of the capabilities and limitations of multimodal models, while ensuring ease of use, accessibility, and reproducibility. Our toolkits are publicly available, will be regularly updated, and welcome inputs from the community.
Multi-perspective Information Fusion Res2Net with RandomSpecmix for Fake Speech Detection
Dong, Shunbo, Xue, Jun, Fan, Cunhang, Zhu, Kang, Chen, Yujie, Lv, Zhao
In this paper, we propose the multi-perspective information fusion (MPIF) Res2Net with random Specmix for fake speech detection (FSD). The main purpose of this system is to improve the model's ability to learn precise forgery information for FSD task in low-quality scenarios. The task of random Specmix, a data augmentation, is to improve the generalization ability of the model and enhance the model's ability to locate discriminative information. Specmix cuts and pastes the frequency dimension information of the spectrogram in the same batch of samples without introducing other data, which helps the model to locate the really useful information. At the same time, we randomly select samples for augmentation to reduce the impact of data augmentation directly changing all the data. Once the purpose of helping the model to locate information is achieved, it is also important to reduce unnecessary information. The role of MPIF-Res2Net is to reduce redundant interference information. Deceptive information from a single perspective is always similar, so the model learning this similar information will produce redundant spoofing clues and interfere with truly discriminative information. The proposed MPIF-Res2Net fuses information from different perspectives, making the information learned by the model more diverse, thereby reducing the redundancy caused by similar information and avoiding interference with the learning of discriminative information. The results on the ASVspoof 2021 LA dataset demonstrate the effectiveness of our proposed method, achieving EER and min-tDCF of 3.29% and 0.2557, respectively.
A pose and shear-based tactile robotic system for object tracking, surface following and object pushing
Tactile perception is a crucial sensing modality in robotics, particularly in scenarios that require precise manipulation and safe interaction with other objects. Previous research in this area has focused extensively on tactile perception of contact poses as this is an important capability needed for tasks such as traversing an object's surface or edge, manipulating an object, or pushing an object along a predetermined path. Another important capability needed for tasks such as object tracking and manipulation is estimation of post-contact shear but this has received much less attention. Indeed, post-contact shear has often been considered a "nuisance variable" and is removed if possible because it can have an adverse effect on other types of tactile perception such as contact pose estimation. This paper proposes a tactile robotic system that can simultaneously estimate both the contact pose and post-contact shear, and use this information to control its interaction with other objects. Moreover, our new system is capable of interacting with other objects in a smooth and continuous manner, unlike the stepwise, position-controlled systems we have used in the past. We demonstrate the capabilities of our new system using several different controller configurations, on tasks including object tracking, surface following, single-arm object pushing, and dual-arm object pushing.
Enhanced multi-fidelity modelling for digital twin and uncertainty quantification
Desai, AS, N, Navaneeth, Adhikari, S, Chakraborty, S
The increasing significance of digital twin technology across engineering and industrial domains, such as aerospace, infrastructure, and automotive, is undeniable. However, the lack of detailed application-specific information poses challenges to its seamless implementation in practical systems. Data-driven models play a crucial role in digital twins, enabling real-time updates and predictions by leveraging data and computational models. Nonetheless, the fidelity of available data and the scarcity of accurate sensor data often hinder the efficient learning of surrogate models, which serve as the connection between physical systems and digital twin models. To address this challenge, we propose a novel framework that begins by developing a robust multi-fidelity surrogate model, subsequently applied for tracking digital twin systems. Our framework integrates polynomial correlated function expansion (PCFE) with the Gaussian process (GP) to create an effective surrogate model called H-PCFE. Going a step further, we introduce deep-HPCFE, a cascading arrangement of models with different fidelities, utilizing nonlinear auto-regression schemes. These auto-regressive schemes effectively address the issue of erroneous predictions from low-fidelity models by incorporating space-dependent cross-correlations among the models. To validate the efficacy of the multi-fidelity framework, we first assess its performance in uncertainty quantification using benchmark numerical examples. Subsequently, we demonstrate its applicability in the context of digital twin systems.
BotanicGarden: A high-quality and large-scale robot navigation dataset in challenging natural environments
Liu, Yuanzhi, Fu, Yujia, Qin, Minghui, Xu, Yufeng, Xu, Baoxin, Chen, Fengdong, Goossens, Bart, Yu, Hongwei, Liu, Chun, Chen, Long, Tao, Wei, Zhao, Hui
The rapid developments of mobile robotics and autonomous navigation over the years are largely empowered by public datasets for testing and upgrading, such as SLAM and localization tasks. Impressive demos and benchmark results have arisen, indicating the establishment of a mature technical framework. However, from the view point of real-world deployments, there are still critical defects of robustness in challenging environments, especially in large-scale, GNSS-denied, textural-monotonous, and unstructured scenarios. To meet the pressing validation demands in such scope, we build a novel challenging robot navigation dataset in a large botanic garden of more than 48000m2. Comprehensive sensors are employed, including high-res/rate stereo Gray&RGB cameras, rotational and forward 3D LiDARs, and low-cost and industrial-grade IMUs, all of which are well calibrated and accurately hardware-synchronized. An all-terrain wheeled robot is configured to mount the sensor suite and provide odometry data. A total of 32 long and short sequences of 2.3 million images are collected, covering scenes of thick woods, riversides, narrow paths, bridges, and grasslands that rarely appeared in previous resources. Excitedly, both highly-accurate ego-motions and 3D map ground truth are provided, along with fine-annotated vision semantics. Our goal is to contribute a high-quality dataset to advance robot navigation and sensor fusion research to a higher level.
DIAS–Earth Environment Data Integration and Analysis System
Our group has been developing and operating a platform to acquire, archive, and manage various data related to the Earth's environment to make it available to researchers across a wide range of fields. Development of this system began in the 1980s to receive, archive, and distribute Asian satellite image data. Currently, the system covers a variety of data including weather, climate change, disaster prevention, biodiversity, health, and agriculture. Today, the Data Integration and Analysis System (DIASa) is a large-scale analysis platform with huge storage and more than 10,000 registered users (half of them in Japan and the other half primarily in Asia). As shown in the accompanying figure, users can easily use data collected by the common collection API through the common use API, and they can operate services at the application layer.
Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech Recognition
Wang, Xuefei, Long, Yanhua, Li, Yijie, Wei, Haoran
Low-resource accented speech recognition is one of the important challenges faced by current ASR technology in practical applications. In this study, we propose a Conformer-based architecture, called Aformer, to leverage both the acoustic information from large non-accented and limited accented training data. Specifically, a general encoder and an accent encoder are designed in the Aformer to extract complementary acoustic information. Moreover, we propose to train the Aformer in a multi-pass manner, and investigate three cross-information fusion methods to effectively combine the information from both general and accent encoders. All experiments are conducted on both the accented English and Mandarin ASR tasks. Results show that our proposed methods outperform the strong Conformer baseline by relative 10.2% to 24.5% word/character error rate reduction on six in-domain and out-of-domain accented test sets.
Multimodal Fusion Transformer for Remote Sensing Image Classification
Roy, Swalpa Kumar, Deria, Ankur, Hong, Danfeng, Rasti, Behnood, Plaza, Antonio, Chanussot, Jocelyn
Vision transformers (ViTs) have been trending in image classification tasks due to their promising performance when compared to convolutional neural networks (CNNs). As a result, many researchers have tried to incorporate ViTs in hyperspectral image (HSI) classification tasks. To achieve satisfactory performance, close to that of CNNs, transformers need fewer parameters. ViTs and other similar transformers use an external classification (CLS) token which is randomly initialized and often fails to generalize well, whereas other sources of multimodal datasets, such as light detection and ranging (LiDAR) offer the potential to improve these models by means of a CLS. In this paper, we introduce a new multimodal fusion transformer (MFT) network which comprises a multihead cross patch attention (mCrossPA) for HSI land-cover classification. Our mCrossPA utilizes other sources of complementary information in addition to the HSI in the transformer encoder to achieve better generalization. The concept of tokenization is used to generate CLS and HSI patch tokens, helping to learn a {distinctive representation} in a reduced and hierarchical feature space. Extensive experiments are carried out on {widely used benchmark} datasets {i.e.,} the University of Houston, Trento, University of Southern Mississippi Gulfpark (MUUFL), and Augsburg. We compare the results of the proposed MFT model with other state-of-the-art transformers, classical CNNs, and conventional classifiers models. The superior performance achieved by the proposed model is due to the use of multihead cross patch attention. The source code will be made available publicly at \url{https://github.com/AnkurDeria/MFT}.}