On the Benefits of Early Fusion in Multimodal Representation Learning

Barnum, George, Talukder, Sabera, Yue, Yisong

Nov-13-2020–arXiv.org Artificial Intelligence

Intelligently reasoning about the world often requires integrating data from multiple modalities, as any individual modality may contain unreliable or incomplete information. On the other hand, the brain performs multimodal processing almost immediately. This divide between conventional multimodal learning and neuroscience suggests that a detailed study of early multimodal fusion could improve artificial multimodal representations. To facilitate the study of early multimodal fusion, we create a convolutional LSTM network architecture that simultaneously processes both audio and visual inputs, and allows us to select the layer at which audio and visual information combines. Our results demonstrate that immediate fusion of audio and visual inputs in the initial C-LSTM layer results in higher performing networks that are more robust to the addition of white noise in both audio and visual inputs. In many cases, an individual modality does not contain sufficient information to classify the scene.

artificial intelligence, machine learning, snr, (17 more...)

arXiv.org Artificial Intelligence

Nov-13-2020

arXiv.org PDF

Add feedback

Country:
- North America > United States > California > Los Angeles County > Pasadena (0.04)

Genre:
- Research Report > New Finding (0.86)

Industry:
- Health & Medicine > Therapeutic Area > Neurology (0.67)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found