XFlow: 1D-2D Cross-modal Deep Neural Networks for Audiovisual Classification

Cangea, Cătălina, Veličković, Petar, Liò, Pietro

arXiv.org Machine Learning 

Abstract-- We propose two multimodal deep learning architectures that allow for cross-modal dataflow (XFlow) between the feature extractors, thereby extracting more interpretable features and obtaining a better representation than through unimodal learning, for the same amount of training data. These models can usefully exploit correlations between audio and visual data, which have a different dimensionality and are therefore nontrivially exchangeable. Our work improves on existing multimodal deep learning metholodogies in two essential ways: (1) it presents a novel method for performing cross-modality (before features are learned from individual modalities) and (2) extends the previously proposed cross-connections [1], which only transfer information between streams that process compatible data. Both cross-modal architectures outperformed their baselines (by up to 7.5%) when evaluated on the AVletters dataset. I. INTRODUCTION An interesting extension of unimodal learning consists of deep models which "fuse" several modalities (for example, sound, image or text) and thereby learn a shared representation, outperforming previous architectures on discriminative tasks.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found