AITopics | multimodal generative model

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Neural Information Processing SystemsDec-25-2025, 13:26:13 GMT

Text-to-image generation and image captioning are recently emerged as a new experimental paradigm to assess machine intelligence. They predict continuous quantity accompanied by their sampling techniques in the generation, making evaluation complicated and intractable to get marginal distributions. Based on a recent trend that multimodal generative evaluations exploit a vison-and-language pre-trained model, we propose the negative Gaussian cross-mutual information using the CLIP features as a unified metric, coined by Mutual Information Divergence (MID). To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model. We look forward to seeing the underrepresented implications of the Gaussian cross-mutual information in multimodal representation learning and future works based on this novel proposition.

multimodal generative model, mutual information divergence, unified metric, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

Multimodal Generative Models for Scalable Weakly-Supervised Learning

Neural Information Processing SystemsNov-20-2025, 21:47:46 GMT

Learning a joint representation of these modalities should yield deeper and more useful representations.Previous generative approaches to multi-modal input either do not learn a joint distribution or require additional computation to handle missing data. Here, we introduce a multimodal variational autoencoder (MVAE) that uses a product-of-experts inference network and a sub-sampled training paradigm to solve the multi-modal inference problem. Notably, our model shares parameters to efficiently learn under any combination of missing modalities. We apply the MVAE on four datasets and match state-of-the-art performance using many fewer parameters. In addition, we show that the MVAE is directly applicable to weakly-supervised learning, and is robust to incomplete supervision. We then consider two case studies, one of learning image transformations---edge detection, colorization, segmentation---as a set of modalities, followed by one of machine translation between two languages. We find appealing results across this range of tasks.

multimodal generative model, name change, scalable weakly-supervised learning, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.82)

Add feedback

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Neural Information Processing SystemsFeb-10-2025, 17:35:36 GMT

Text-to-image generation and image captioning are recently emerged as a new experimental paradigm to assess machine intelligence. They predict continuous quantity accompanied by their sampling techniques in the generation, making evaluation complicated and intractable to get marginal distributions. Based on a recent trend that multimodal generative evaluations exploit a vison-and-language pre-trained model, we propose the negative Gaussian cross-mutual information using the CLIP features as a unified metric, coined by Mutual Information Divergence (MID). To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model.

multimodal generative model, mutual information divergence, unified metric, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.40)

Add feedback

Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

Liu, Xuannan, Cui, Xing, Li, Peipei, Li, Zekun, Huang, Huaibo, Xia, Shuhan, Zhang, Miaoxuan, Zou, Yueying, He, Ran

arXiv.org Artificial IntelligenceDec-9-2024

The rapid evolution of multimodal foundation models has led to significant advancements in cross-modal understanding and generation across diverse modalities, including text, images, audio, and video. However, these models remain susceptible to jailbreak attacks, which can bypass built-in safety mechanisms and induce the production of potentially harmful content. Consequently, understanding the methods of jailbreak attacks and existing defense mechanisms is essential to ensure the safe deployment of multimodal generative models in real-world scenarios, particularly in security-sensitive applications. To provide comprehensive insight into this topic, this survey reviews jailbreak and defense in multimodal generative models. First, given the generalized lifecycle of multimodal jailbreak, we systematically explore attacks and corresponding defense strategies across four levels: input, encoder, generator, and output. Based on this analysis, we present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models. Additionally, we cover a wide range of input-output configurations, including modalities such as Any-to-Text, Any-to-Vision, and Any-to-Any within generative systems. Finally, we highlight current research challenges and propose potential directions for future research. The open-source repository corresponding to this work can be found at https://github.com/liuxuannan/Awesome-Multimodal-Jailbreak.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2411.09259

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States > California (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Law (0.92)
Government > Military (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(3 more...)

Add feedback

Reviews: Multimodal Generative Models for Scalable Weakly-Supervised Learning

Neural Information Processing SystemsOct-7-2024, 05:11:30 GMT

This paper presents a generative approach to multimodal deep learning based on a product-of-experts (PoE) inference network. The main idea is to assume the joint distribution over all modalities factorises into a product of single-modality data-generating distributions when conditioned on the latent space, and use this to derive the structure and factorisation of the variational posterior. The proposed model shares parameters to efficiently handle any combination of missing modalities, and experiments indicate the model's efficacy on various benchmark datasets. The idea is intuitive, the exposition is well-written and easy to follow, and the results are thorough and compelling. I have a few questions / comments, mainly about the relationship of this work with respect to previous approaches ([15] and [21] in the text).

modality, multimodal generative model, scalable weakly-supervised learning, (11 more...)

Neural Information Processing Systems

Genre: Summary/Review (0.36)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Revision Matters: Generative Design Guided by Revision Edits

Li, Tao, Cheng, Chin-Yi, Xie, Amber, Li, Gang, Li, Yang

arXiv.org Artificial IntelligenceMay-27-2024

Layout design, such as user interface or graphical layout in general, is fundamentally an iterative revision process. Through revising a design repeatedly, the designer converges on an ideal layout. In this paper, we investigate how revision edits from human designer can benefit a multimodal generative model. To do so, we curate an expert dataset that traces how human designers iteratively edit and improve a layout generation with a prompted language goal. Based on such data, we explore various supervised fine-tuning task setups on top of a Gemini multimodal backbone, a large multimodal model. Our results show that human revision plays a critical role in iterative layout refinement. While being noisy, expert revision edits lead our model to a surprisingly strong design FID score ~10 which is close to human performance (~6). In contrast, self-revisions that fully rely on model's own judgement, lead to an echo chamber that prevents iterative improvement, and sometimes leads to generative degradation. Fortunately, we found that providing human guidance plays at early stage plays a critical role in final generation. In such human-in-the-loop scenario, our work paves the way for iterative design revision based on pre-trained large multimodal models.

layout, revision, revision edit, (16 more...)

arXiv.org Artificial Intelligence

2406.18559

Country:

North America > United States > California > Santa Clara County > Mountain View (0.05)
North America > United States > California > Santa Clara County > Stanford (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

A survey of multimodal deep generative models

Suzuki, Masahiro, Matsuo, Yutaka

arXiv.org Machine LearningJul-5-2022

Multimodal learning is a framework for building models that make predictions based on different types of modalities. Important challenges in multimodal learning are the inference of shared representations from arbitrary modalities and cross-modal generation via these representations; however, achieving this requires taking the heterogeneous nature of multimodal data into account. In recent years, deep generative models, i.e., generative models in which distributions are parameterized by deep neural networks, have attracted much attention, especially variational autoencoders, which are suitable for accomplishing the above challenges because they can consider heterogeneity and infer good representations of data. Therefore, various multimodal generative models based on variational autoencoders, called multimodal deep generative models, have been proposed in recent years. In this paper, we provide a categorized survey of studies on multimodal deep generative models.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

doi: 10.1080/01691864.2022.2035253

2207.02127

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > California (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Overview (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)

Add feedback

Learning more expressive joint distributions in multimodal variational methods

Nedelkoski, Sasho, Bogojeski, Mihail, Kao, Odej

arXiv.org Artificial IntelligenceSep-8-2020

Data often are formed of multiple modalities, which jointly describe the observed phenomena. Modeling the joint distribution of multimodal data requires larger expressive power to capture high-level concepts and provide better data representations. However, multimodal generative models based on variational inference are limited due to the lack of flexibility of the approximate posterior, which is obtained by searching within a known parametric family of distributions. We introduce a method that improves the representational capacity of multimodal variational methods using normalizing flows. It approximates the joint posterior with a simple parametric distribution and subsequently transforms into a more complex one. Through several experiments, we demonstrate that the model improves on state-of-the-art multimodal methods based on variational inference on various computer vision tasks such as colorization, edge and mask detection, and weakly supervised learning. We also show that learning more powerful approximate joint distributions improves the quality of the generated samples.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2009.03651

Country:

Europe > Germany > Berlin (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (0.64)
Instructional Material (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

MHVAE: a Human-Inspired Deep Hierarchical Generative Model for Multimodal Representation Learning

Vasco, Miguel, Melo, Francisco S., Paiva, Ana

arXiv.org Machine LearningJun-4-2020

Humans are able to create rich representations of their external reality. Their internal representations allow for cross-modality inference, where available perceptions can induce the perceptual experience of missing input modalities. In this paper, we contribute the Multimodal Hierarchical Variational Auto-encoder (MHVAE), a hierarchical multimodal generative model for representation learning. Inspired by human cognitive models, the MHVAE is able to learn modality-specific distributions, of an arbitrary number of modalities, and a joint-modality distribution, responsible for cross-modality inference. We formally derive the model's evidence lower bound and propose a novel methodology to approximate the joint-modality posterior based on modality-specific representation dropout. We evaluate the MHVAE on standard multimodal datasets. Our model performs on par with other state-of-the-art generative models regarding joint-modality reconstruction from arbitrary input modalities and cross-modality inference.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Machine Learning

2006.02991

Country: Europe > Portugal > Lisbon > Lisbon (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Robots (0.94)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.85)

Add feedback

Multimodal Generative Models for Scalable Weakly-Supervised Learning

Wu, Mike, Goodman, Noah

Neural Information Processing SystemsFeb-14-2020, 17:11:08 GMT

Learning a joint representation of these modalities should yield deeper and more useful representations.Previous generative approaches to multi-modal input either do not learn a joint distribution or require additional computation to handle missing data. Here, we introduce a multimodal variational autoencoder (MVAE) that uses a product-of-experts inference network and a sub-sampled training paradigm to solve the multi-modal inference problem. Notably, our model shares parameters to efficiently learn under any combination of missing modalities. We apply the MVAE on four datasets and match state-of-the-art performance using many fewer parameters. In addition, we show that the MVAE is directly applicable to weakly-supervised learning, and is robust to incomplete supervision.

modality, multimodal generative model, scalable weakly-supervised learning, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.65)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.40)

Add feedback

Filters

Collaborating Authors

multimodal generative model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Multimodal Generative Models for Scalable Weakly-Supervised Learning

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

Reviews: Multimodal Generative Models for Scalable Weakly-Supervised Learning

Revision Matters: Generative Design Guided by Revision Edits

A survey of multimodal deep generative models

Learning more expressive joint distributions in multimodal variational methods

MHVAE: a Human-Inspired Deep Hierarchical Generative Model for Multimodal Representation Learning

Multimodal Generative Models for Scalable Weakly-Supervised Learning