Goto

Collaborating Authors

 Mallya, Arun


Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

arXiv.org Artificial Intelligence

We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360 HDR panorama generation, and finetuning for image customization.


Movie Gen: A Cast of Media Foundation Models

arXiv.org Artificial Intelligence

We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos.


Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion

arXiv.org Machine Learning

We introduce DeepInversion, a new method for synthesizing images from the image distribution used to train a deep neural network. We 'invert' a trained network (teacher) to synthesize class-conditional input images starting from random noise, without using any additional information about the training dataset. Keeping the teacher fixed, our method optimizes the input while regularizing the distribution of intermediate feature maps using information stored in the batch normalization layers of the teacher. Further, we improve the diversity of synthesized images using Adaptive DeepInversion, which maximizes the Jensen-Shannon divergence between the teacher and student network logits. The resulting synthesized images from networks trained on the CIFAR-10 and ImageNet datasets demonstrate high fidelity and degree of realism, and help enable a new breed of data-free applications - ones that do not require any real images or labeled data. We demonstrate the applicability of our proposed method to three tasks of immense practical importance -- (i) data-free network pruning, (ii) data-free knowledge transfer, and (iii) data-free continual learning.


Importance Estimation for Neural Network Pruning

arXiv.org Machine Learning

Structural pruning of neural network parameters reduces computation, energy, and memory transfer costs during inference. We propose a novel method that estimates the contribution of a neuron (filter) to the final loss and iteratively removes those with smaller scores. We describe two variations of our method using the first and second-order Taylor expansions to approximate a filter's contribution. Both methods scale consistently across any network layer without requiring per-layer sensitivity analysis and can be applied to any kind of layer, including skip connections. For modern networks trained on ImageNet, we measured experimentally a high (>93%) correlation between the contribution computed by our methods and a reliable estimate of the true importance. Pruning with the proposed methods leads to an improvement over state-of-the-art in terms of accuracy, FLOPs, and parameter reduction. On ResNet-101, we achieve a 40% FLOPS reduction by removing 30% of the parameters, with a loss of 0.02% in the top-1 accuracy on ImageNet. Code is available at https://github.com/NVlabs/Taylor_pruning.


Few-Shot Unsupervised Image-to-Image Translation

arXiv.org Machine Learning

Image-to-image Translation (FUNIT) framework, aiming at learning an image-to-image translation Unsupervised image-to-image translation methods learn model for mapping an image of a source class to an analogous to map images in a given class to an analogous image in image of a target class by leveraging few images of a different class, drawing on unstructured (non-registered) the target class given at test time. The model is never shown datasets of images. While remarkably successful, current images of the target class during training but is asked to methods require access to many images in both source and generate some of them at test time. To proceed, we first hypothesize destination classes at training time. We argue this greatly that the few-shot generation capability of humans limits their use. Drawing inspiration from the human capability develops from their past visual experiences--a person can of picking up the essence of a novel object from better imagine views of a new object if the person has seen a small number of examples and generalizing from there, many more different object classes in the past. Based on we seek a few-shot, unsupervised image-to-image translation the hypothesis, we train our FUNIT model using a dataset algorithm that works on previously unseen target containing images of many different object classes for simulating classes that are specified, at test time, only by a few example the past visual experiences. Specifically, we train the images. Our model achieves this few-shot generation model to translate images from one class to another class capability by coupling an adversarial training scheme by leveraging few example images of the another class.