Goto

Collaborating Authors

 Generation


Unlocking the Capabilities of Masked Generative Models for Image Synthesis via Self-Guidance Dong-Jae Lee 1 Gyojin Han

Neural Information Processing Systems

Masked generative models (MGMs) have shown impressive generative ability while providing an order of magnitude efficient sampling steps compared to continuous diffusion models. However, MGMs still underperform in image synthesis compared to recent well-developed continuous diffusion models with similar size in terms of quality and diversity of generated samples. A key factor in the performance of continuous diffusion models stems from the guidance methods, which enhance the sample quality at the expense of diversity. In this paper, we extend these guidance methods to generalized guidance formulation for MGMs and propose a self-guidance sampling method, which leads to better generation quality. The proposed approach leverages an auxiliary task for semantic smoothing in vectorquantized token space, analogous to the Gaussian blur in continuous pixel space. Equipped with the parameter-efficient fine-tuning method and high-temperature sampling, MGMs with the proposed self-guidance achieve a superior qualitydiversity trade-off, outperforming existing sampling methods in MGMs with more efficient training and sampling costs. Extensive experiments with the various sampling hyperparameters confirm the effectiveness of the proposed self-guidance.


Towards a Scalable Reference-Free Evaluation of Generative Models

Neural Information Processing Systems

While standard evaluation scores for generative models are mostly reference-based, a reference-dependent assessment of generative models could be generally difficult due to the unavailability of applicable reference datasets. Recently, the referencefree entropy scores, VENDI [1] and RKE [2], have been proposed to evaluate the diversity of generated data. However, estimating these scores from data leads to significant computational costs for large-scale generative models. In this work, we leverage the random Fourier features framework to reduce the computational price and propose the Fourier-based Kernel Entropy Approximation (FKEA) method. We utilize FKEA's approximated eigenspectrum of the kernel matrix to efficiently estimate the mentioned entropy scores. Furthermore, we show the application of FKEA's proxy eigenvectors to reveal the method's identified modes in evaluating the diversity of produced samples. We provide a stochastic implementation of the FKEA assessment algorithm with a complexity O(n) linearly growing with sample size n. We extensively evaluate FKEA's numerical performance in application to standard image, text, and video datasets. Our empirical results indicate the method's scalability and interpretability applied to large-scale generative models.


Emergence of Object Segmentation in Perturbed Generative Models

Neural Information Processing Systems

We introduce a framework to learn object segmentation from a collection of images without any manual annotation. We build on the observation that the location of object segments can be perturbed locally relative to a given background without affecting the realism of a scene. First, we train a generative model of a layered scene. The layered representation consists of a background image, a foreground image and the mask of the foreground. A composite image is then obtained by overlaying the masked foreground image onto the background. The generative model is trained in an adversarial fashion against a discriminator, which forces the generative model to produce realistic composite images. To force the generator to learn a representation where the foreground layer corresponds to an object, we perturb the output of the generative model by introducing a random shift of both the foreground image and mask relative to the background. Because the generator is unaware of the shift before computing its output, it must produce layered representations that are realistic for any such random perturbation. Second, we learn to segment an image by defining an autoencoder consisting of an encoder, which we train, and the pretrained generator as the decoder, which we fix.


Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model

Neural Information Processing Systems

This paper focuses on the problem of computing an ǫ-optimal policy in a discounted Markov Decision Process (MDP) provided that we can access the reward and transition function through a generative model. We propose an algorithm that is initially agnostic to the MDP but that can leverage the specific MDP structure, expressed in terms of variances of the rewards and next-state value function, and gaps in the optimal action-value function to reduce the sample complexity needed to find a good policy, precisely highlighting the contribution of each state-action pair to the final sample complexity. A key feature of our analysis is that it removes all horizon dependencies in the sample complexity of suboptimal actions except for the intrinsic scaling of the value function and a constant additive term.


are: 1) A novel generative model, (SI-)VGRNN, is proposed to achieve more interpretable latent representations

Neural Information Processing Systems

We truly appreciate helpful comments from all three reviewers. R1 asked how (SI-)VGRNN deals with deletions and additions of nodes. Note our model is not designed to predict the occurrence of new nodes. Eq. 4. Note that the posterior of Z R4 also asked about reconstructing node attributes. A more comprehensive related work section will be added.


Few-Shot Task Learning through Inverse Generative Modeling

Neural Information Processing Systems

Learning the intents of an agent, defined by its goals or motion style, is often extremely challenging from just a few examples. We refer to this problem as task concept learning and present our approach, Few-Shot Task Learning through Inverse Generative Modeling (FTL-IGM), which learns new task concepts by leveraging invertible neural generative models. The core idea is to pretrain a generative model on a set of basic concepts and their demonstrations. Then, given a few demonstrations of a new concept (such as a new goal or a new action), our method learns the underlying concepts through backpropagation without updating the model weights, thanks to the invertibility of the generative model. We evaluate our method in five domains - object rearrangement, goal-oriented navigation, motion caption of human actions, autonomous driving, and real-world table-top manipulation. Our experimental results demonstrate that via the pretrained generative model, we successfully learn novel concepts and generate agent plans or motion corresponding to these concepts in (1) unseen environments and (2) in composition with training concepts.


Transcendence: Generative Models Can Outperform The Experts That Train Them

Neural Information Processing Systems

Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset.


A Prior of a Googol Gaussians: a Tensor Ring Induced Prior for Generative Models

Neural Information Processing Systems

Generative models produce realistic objects in many domains, including text, image, video, and audio synthesis. Most popular models--Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs)--usually employ a standard Gaussian distribution as a prior. Previous works show that the richer family of prior distributions may help to avoid the mode collapse problem in GANs and to improve the evidence lower bound in VAEs. We propose a new family of prior distributions--Tensor Ring Induced Prior (TRIP)--that packs an exponential number of Gaussians into a high-dimensional lattice with a relatively small number of parameters. We show that these priors improve Fréchet Inception Distance for GANs and Evidence Lower Bound for VAEs. We also study generative models with TRIP in the conditional generation setup with missing conditions. Altogether, we propose a novel plug-and-play framework for generative models that can be utilized in any GAN and VAE-like architectures.


Defending Against Neural Fake News

Neural Information Processing Systems

Recent progress in natural language generation has raised dual-use concerns. While applications like summarization and translation are positive, the underlying technology also might enable adversaries to generate neural fake news: targeted propaganda that closely mimics the style of real news. Modern computer security relies on careful threat modeling: identifying potential threats and vulnerabilities from an adversary's point of view, and exploring potential mitigations to these threats. Likewise, developing robust defenses against neural fake news requires us first to carefully investigate and characterize the risks of these models. We thus present a model for controllable text generation called Grover.


Multimodal Large Language Models Make Text-to-Image Generative Models Align Better

Neural Information Processing Systems

Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, making it to generate more human-preferred images. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges, we first leverage multimodal large language models to create VisionPrefer, a fine-grained preference dataset that captures multiple preference aspects (prompt-following, aesthetic, fidelity, and harmlessness). Then we train a corresponding reward model, VP-Score, over VisionPrefer to guide the tuning of text-to-image generative models. The preference prediction accuracy of VP-Score is validated to be comparable to that of human annotators.