Learning Multimodal VAEs through Mutual Supervision

Joy, Tom, Shi, Yuge, Torr, Philip H. S., Rainforth, Tom, Schmon, Sebastian M., Siddharth, N.

Dec-16-2022–arXiv.org Artificial Intelligence

Multimodal variational autoencoders (VAEs) seek to model the joint distribution over heterogeneous data (e.g. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the recognition model through explicit products, mixtures, or other such factorisations. Here we introduce a novel alternative, the Mutually supErvised Multimodal VAE (MEME), that avoids such explicit combinations by repurposing semisupervised VAEs to combine information between modalities implicitly through mutual supervision. This formulation naturally allows learning from partiallyobserved data where some modalities can be entirely missing--something that most existing approaches either cannot handle, or do so to a limited extent. Modelling the generative process underlying heterogenous data, particularly data spanning multiple perceptual modalities such as vision or language, can be enormously challenging. Consider for example, the case where data spans across photographs and sketches of objects. Here, a data point, comprising of an instance from each modality, is constrained by the fact that the instances are related and must depict the same underlying abstract concept. An effective model not only needs to faithfully generate data in each of the different modalities, it also needs to do so in a manner that preserves the underlying relation between modalities. Learning a model over multimodal data thus relies on the ability to bring together information from idiosyncratic sources in such a way as to overlap on aspects they relate on, while remaining disjoint otherwise. Variational autoencoders (VAEs) (Kingma & Welling, 2014) are a class of deep generative models that are particularly well-suited for multimodal data as they employ the use of encoders-- learnable mappings from high-dimensional data to lower-dimensional representations--that provide the means to combine information across modalities.

artificial intelligence, machine learning, modality, (20 more...)

arXiv.org Artificial Intelligence

Dec-16-2022

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found