XFlow: 1D-2D Cross-modal Deep Neural Networks for Audiovisual Classification
Cangea, Cătălina, Veličković, Petar, Liò, Pietro
Abstract-- We propose two multimodal deep learning architectures that allow for cross-modal dataflow (XFlow) between the feature extractors, thereby extracting more interpretable features and obtaining a better representation than through unimodal learning, for the same amount of training data. These models can usefully exploit correlations between audio and visual data, which have a different dimensionality and are therefore nontrivially exchangeable. Our work improves on existing multimodal deep learning metholodogies in two essential ways: (1) it presents a novel method for performing cross-modality (before features are learned from individual modalities) and (2) extends the previously proposed cross-connections [1], which only transfer information between streams that process compatible data. Both cross-modal architectures outperformed their baselines (by up to 7.5%) when evaluated on the AVletters dataset. I. INTRODUCTION An interesting extension of unimodal learning consists of deep models which "fuse" several modalities (for example, sound, image or text) and thereby learn a shared representation, outperforming previous architectures on discriminative tasks.
Sep-2-2017
- Country:
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Genre:
- Research Report > Promising Solution (0.34)
- Technology: