XFlow: 1D-2D Cross-modal Deep Neural Networks for Audiovisual Classification

Cangea, Cătălina, Veličković, Petar, Liò, Pietro

arXiv.org Machine Learning 

We propose two multimodal deep learning architectures that allow for cross-modal dataflow (XFlow) between the feature extractors, thereby extracting more interpretable features and obtaining a better representation than through unimodal learning, for the same amount of training data. These models can usefully exploit correlations between audio and visual data, which have a different dimensionality and are therefore nontrivially exchangeable. Our work improves on existing multimodal deep learning metholodogies in two essential ways: (1) it presents a novel method for performing cross-modality (before features are learned from individual modalities) and (2) extends the previously proposed cross-connections, which only transfer information between streams that process compatible data. Both cross-modal architectures outperformed their baselines (by up to 7.5%) when evaluated on the AVletters dataset.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found