AITopics | multimodal vae

Collaborating Authors

multimodal vae

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Unity by Diversity: Improved Representation Learning for Multimodal VAEs

Neural Information Processing SystemsMay-27-2025, 07:42:09 GMT

Variational Autoencoders for multimodal data hold promise for many tasks in data analysis, such as representation learning, conditional generation, and imputation.Current architectures either share the encoder output, decoder input, or both across modalities to learn a shared representation. Such architectures impose hard constraints on the model. In this work, we show that a better latent representation can be obtained by replacing these hard constraints with a soft constraint. We propose a new mixture-of-experts prior, softly guiding each modality's latent representation towards a shared aggregate posterior.This approach results in a superior latent representation and allows each encoding to preserve information better from its uncompressed original features. In extensive experiments on multiple benchmark datasets and two challenging real-world datasets, we show improved learned latent representations and imputation of missing data modalities compared to existing methods.

artificial intelligence, constraint-based reasoning, latent representation, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.91)

Add feedback

Multimodal Variational Autoencoder: a Barycentric View

Qiu, Peijie, Zhu, Wenhui, Kumar, Sayantan, Chen, Xiwen, Sun, Xiaotong, Yang, Jin, Razi, Abolfazl, Wang, Yalin, Sotiras, Aristeidis

arXiv.org Artificial IntelligenceDec-29-2024

Multiple signal modalities, such as vision and sounds, are naturally present in real-world phenomena. Recently, there has been growing interest in learning generative models, in particular variational autoencoder (VAE), to for multimodal representation learning especially in the case of missing modalities. The primary goal of these models is to learn a modality-invariant and modality-specific representation that characterizes information across multiple modalities. Previous attempts at multimodal VAEs approach this mainly through the lens of experts, aggregating unimodal inference distributions with a product of experts (PoE), a mixture of experts (MoE), or a combination of both. In this paper, we provide an alternative generic and theoretical formulation of multimodal VAE through the lens of barycenter. We first show that PoE and MoE are specific instances of barycenters, derived by minimizing the asymmetric weighted KL divergence to unimodal inference distributions. Our novel formulation extends these two barycenters to a more flexible choice by considering different types of divergences. In particular, we explore the Wasserstein barycenter defined by the 2-Wasserstein distance, which better preserves the geometry of unimodal distributions by capturing both modality-specific and modality-invariant representations compared to KL divergence. Empirical studies on three multimodal benchmarks demonstrated the effectiveness of the proposed method.

artificial intelligence, machine learning, modality, (19 more...)

arXiv.org Artificial Intelligence

2412.20487

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Weakly-Supervised Multimodal Learning on MIMIC-CXR

Agostini, Andrea, Chopard, Daphné, Meng, Yang, Fortin, Norbert, Shahbaba, Babak, Mandt, Stephan, Sutter, Thomas M., Vogt, Julia E.

arXiv.org Artificial IntelligenceNov-15-2024

Multimodal data integration and label scarcity pose significant challenges for machine learning in medical settings. To address these issues, we conduct an in-depth evaluation of the newly proposed Multimodal Variational Mixture-of-Experts (MMVM) VAE on the challenging MIMIC-CXR dataset. Our analysis demonstrates that the MMVM VAE consistently outperforms other multimodal VAEs and fully supervised approaches, highlighting its strong potential for real-world medical applications.

artificial intelligence, machine learning, representation, (17 more...)

arXiv.org Artificial Intelligence

2411.10356

Country:

Europe > Switzerland > Zürich > Zürich (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (0.94)
Health & Medicine > Health Care Technology (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference

Oshima, Yuta, Suzuki, Masahiro, Matsuo, Yutaka

arXiv.org Artificial IntelligenceOct-15-2024

Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities. A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number (2^M) of inference networks for all possible modality combinations. Mixture-based models simplify this by requiring only as many inference models as there are modalities, aggregating unimodal inferences. However, they suffer from information loss when modalities are missing. Alignment-based VAEs address this by aligning unimodal inference models with a multimodal model through minimizing the Kullback-Leibler (KL) divergence but face issues due to amortization gaps, which compromise inference accuracy. To tackle these problems, we introduce multimodal iterative amortized inference, an iterative refinement mechanism within the multimodal VAE framework. This method overcomes information loss from missing modalities and minimizes the amortization gap by iteratively refining the multimodal inference using all available modalities. By aligning unimodal inference to this refined multimodal posterior, we achieve unimodal inferences that effectively incorporate multimodal information while requiring only unimodal inputs during inference. Experiments on benchmark datasets show that our approach improves inference performance, evidenced by higher linear classification accuracy and competitive cosine similarity, and enhances cross-modal generation, indicated by lower FID scores. This demonstrates that our method enhances inferred representations from unimodal inputs.

artificial intelligence, inference, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2410.11403

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > California (0.04)

Genre: Research Report > New Finding (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Revising Multimodal VAEs with Diffusion Decoders

Wesego, Daniel, Rooshenas, Amirmohammad

arXiv.org Artificial IntelligenceAug-29-2024

Multimodal VAEs often struggle with generating high-quality outputs, a challenge that extends beyond the inherent limitations of the VAE framework. The core issue lies in the restricted joint representation of the latent space, particularly when complex modalities like images are involved. Feedforward decoders, commonly used for these intricate modalities, inadvertently constrain the joint latent space, leading to a degradation in the quality of the other modalities as well. Although recent studies have shown improvement by introducing modality-specific representations, the issue remains significant. In this work, we demonstrate that incorporating a flexible diffusion decoder specifically for the image modality not only enhances the generation quality of the images but also positively impacts the performance of the other modalities that rely on feedforward decoders. This approach addresses the limitations imposed by conventional joint representations and opens up new possibilities for improving multimodal generation tasks using the multimodal VAE framework.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2408.16883

Country:

Europe > Austria (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(8 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Bayesian Structural Model Updating with Multimodal Variational Autoencoder

Itoi, Tatsuya, Amishiki, Kazuho, Lee, Sangwon, Yaoyama, Taro

arXiv.org Machine LearningJun-20-2024

A novel framework for Bayesian structural model updating is presented in this study. The proposed method utilizes the surrogate unimodal encoders of a multimodal variational autoencoder (VAE). The method facilitates an approximation of the likelihood when dealing with a small number of observations. It is particularly suitable for high-dimensional correlated simultaneous observations applicable to various dynamic analysis models. The proposed approach was benchmarked using a numerical model of a single-story frame building with acceleration and dynamic strain measurements. Additionally, an example involving a Bayesian update of nonlinear model parameters for a three-degree-of-freedom lumped mass model demonstrates computational efficiency when compared to using the original VAE, while maintaining adequate accuracy for practical applications.

license by-nc-nd 4, manuscript version, model parameter, (14 more...)

arXiv.org Machine Learning

doi: 10.1016/j.cma.2024.117148

2406.09051

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > Japan > Kyūshū & Okinawa > Kyūshū > Kumamoto Prefecture > Kumamoto (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

Unity by Diversity: Improved Representation Learning in Multimodal VAEs

Sutter, Thomas M., Meng, Yang, Agostini, Andrea, Chopard, Daphné, Fortin, Norbert, Vogt, Julia E., Shahbaba, Bahbak, Mandt, Stephan

arXiv.org Artificial IntelligenceMay-31-2024

Variational Autoencoders for multimodal data hold promise for many tasks in data analysis, such as representation learning, conditional generation, and imputation. Current architectures either share the encoder output, decoder input, or both across modalities to learn a shared representation. Such architectures impose hard constraints on the model. In this work, we show that a better latent representation can be obtained by replacing these hard constraints with a soft constraint. We propose a new mixture-of-experts prior, softly guiding each modality's latent representation towards a shared aggregate posterior. This approach results in a superior latent representation and allows each encoding to preserve information better from its uncompressed original features. In extensive experiments on multiple benchmark datasets and two challenging real-world datasets, we show improved learned latent representations and imputation of missing data modalities compared to existing methods.

latent representation, modality, representation, (15 more...)

arXiv.org Artificial Intelligence

2403.053

Country:

Europe > Switzerland > Zürich > Zürich (0.04)
North America > United States > California (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (0.68)
Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

Sejnova, Gabriela, Vavrecka, Michal, Stepanova, Karla

arXiv.org Artificial IntelligenceApr-2-2024

In this work, we focus on unsupervised vision-language-action mapping in the area of robotic manipulation. Recently, multiple approaches employing pre-trained large language and vision models have been proposed for this task. However, they are computationally demanding and require careful fine-tuning of the produced outputs. A more lightweight alternative would be the implementation of multimodal Variational Autoencoders (VAEs) which can extract the latent features of the data and integrate them into a joint representation, as has been demonstrated mostly on image-image or image-text data for the state-of-the-art models. Here we explore whether and how can multimodal VAEs be employed in unsupervised robotic manipulation tasks in a simulated environment. Based on the obtained results, we propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%. Moreover, we systematically evaluate the challenges raised by the individual tasks such as object or robot position variability, number of distractors or the task length. Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories based on vision and language.

modality, multimodal vae, trajectory, (15 more...)

arXiv.org Artificial Intelligence

2404.01932

Country:

Europe > Czechia > Prague (0.04)
North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Add feedback

Disentangling shared and private latent factors in multimodal Variational Autoencoders

Märtens, Kaspar, Yau, Christopher

arXiv.org Machine LearningMar-10-2024

Generative models for multimodal data permit the identification of latent factors that may be associated with important determinants of observed data heterogeneity. Common or shared factors could be important for explaining variation across modalities whereas other factors may be private and important only for the explanation of a single modality. Multimodal Variational Autoencoders, such as MVAE and MMVAE, are a natural choice for inferring those underlying latent factors and separating shared variation from private. In this work, we investigate their capability to reliably perform this disentanglement. In particular, we highlight a challenging problem setting where modality-specific variation dominates the shared signal. Taking a cross-modal prediction perspective, we demonstrate limitations of existing models, and propose a modification how to make them more robust to modality-specific variation. Our findings are supported by experiments on synthetic as well as various real-world multi-omics data sets.

latent factor, modality, variation, (14 more...)

arXiv.org Machine Learning

2403.06338

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Hematology (0.68)
Health & Medicine > Therapeutic Area > Oncology > Leukemia (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and Toolkit

Sejnova, Gabriela, Vavrecka, Michal, Stepanova, Karla

arXiv.org Artificial IntelligenceNov-24-2023

Multimodal Variational Autoencoders (VAEs) have been the subject of intense research in the past years as they can integrate multiple modalities into a joint representation and can thus serve as a promising tool for both data classification and generation. Several approaches toward multimodal VAE learning have been proposed so far, their comparison and evaluation have however been rather inconsistent. One reason is that the models differ at the implementation level, another problem is that the datasets commonly used in these cases were not initially designed to evaluate multimodal generative models. This paper addresses both mentioned issues. First, we propose a toolkit for systematic multimodal VAE training and comparison. The toolkit currently comprises 4 existing multimodal VAEs and 6 commonly used benchmark datasets along with instructions on how to easily add a new model or a dataset. Second, we present a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities across multiple difficulty levels. We demonstrate the utility of our dataset by comparing the implemented state-of-the-art models.

dataset, evaluation, modality, (16 more...)

arXiv.org Artificial Intelligence

2209.03048

Country:

Europe > Czechia > Prague (0.04)
North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
Asia (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback