AITopics | Huang, Xun

Collaborating Authors

Huang, Xun

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DiffCollage: Parallel Generation of Large Content with Diffusion Models

Zhang, Qinsheng, Song, Jiaming, Huang, Xun, Chen, Yongxin, Liu, Ming-Yu

arXiv.org Artificial IntelligenceMar-29-2023

We present DiffCollage, a compositional diffusion model that can generate large content by leveraging diffusion models trained on generating pieces of the large content. Our approach is based on a factor graph representation where each factor node represents a portion of the content and a variable node represents their overlap. This representation allows us to aggregate intermediate outputs from diffusion models defined on individual nodes to generate content of arbitrary size and shape in parallel without resorting to an autoregressive generation procedure. We apply DiffCollage to various tasks, including infinite image generation, panorama image generation, and long-duration text-guided motion generation. Extensive experimental results with a comparison to strong autoregressive baselines verify the effectiveness of our approach.

diffusion model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2303.17076

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Magic3D: High-Resolution Text-to-3D Content Creation

Lin, Chen-Hsuan, Gao, Jun, Tang, Luming, Takikawa, Towaki, Zeng, Xiaohui, Huang, Xun, Kreis, Karsten, Fidler, Sanja, Liu, Ming-Yu, Lin, Tsung-Yi

arXiv.org Artificial IntelligenceMar-25-2023

DreamFusion has recently demonstrated the utility of a pre-trained text-to-image diffusion model to optimize Neural Radiance Fields (NeRF), achieving remarkable text-to-3D synthesis results. However, the method has two inherent limitations: (a) extremely slow optimization of NeRF and (b) low-resolution image space supervision on NeRF, leading to low-quality 3D models with a long processing time. In this paper, we address these limitations by utilizing a two-stage optimization framework. First, we obtain a coarse model using a low-resolution diffusion prior and accelerate with a sparse 3D hash grid structure. Using the coarse representation as the initialization, we further optimize a textured 3D mesh model with an efficient differentiable renderer interacting with a high-resolution latent diffusion model. Our method, dubbed Magic3D, can create high quality 3D mesh models in 40 minutes, which is 2x faster than DreamFusion (reportedly taking 1.5 hours on average), while also achieving higher resolution. User studies show 61.7% raters to prefer our approach over DreamFusion. Together with the image-conditioned generation capabilities, we provide users with new ways to control 3D synthesis, opening up new avenues to various creative applications.

artificial intelligence, machine learning, optimization problem, (18 more...)

arXiv.org Artificial Intelligence

2211.1044

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (0.67)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)

Add feedback

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Balaji, Yogesh, Nah, Seungjun, Huang, Xun, Vahdat, Arash, Song, Jiaming, Zhang, Qinsheng, Kreis, Karsten, Aittala, Miika, Aila, Timo, Laine, Samuli, Catanzaro, Bryan, Karras, Tero, Liu, Ming-Yu

arXiv.org Artificial IntelligenceMar-13-2023

Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/

diffusion model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2211.01324

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Sports (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.30)

Add feedback

Few-Shot Unsupervised Image-to-Image Translation

Liu, Ming-Yu, Huang, Xun, Mallya, Arun, Karras, Tero, Aila, Timo, Lehtinen, Jaakko, Kautz, Jan

arXiv.org Machine LearningMay-5-2019

Image-to-image Translation (FUNIT) framework, aiming at learning an image-to-image translation Unsupervised image-to-image translation methods learn model for mapping an image of a source class to an analogous to map images in a given class to an analogous image in image of a target class by leveraging few images of a different class, drawing on unstructured (non-registered) the target class given at test time. The model is never shown datasets of images. While remarkably successful, current images of the target class during training but is asked to methods require access to many images in both source and generate some of them at test time. To proceed, we first hypothesize destination classes at training time. We argue this greatly that the few-shot generation capability of humans limits their use. Drawing inspiration from the human capability develops from their past visual experiences--a person can of picking up the essence of a novel object from better imagine views of a new object if the person has seen a small number of examples and generalizing from there, many more different object classes in the past. Based on we seek a few-shot, unsupervised image-to-image translation the hypothesis, we train our FUNIT model using a dataset algorithm that works on previously unseen target containing images of many different object classes for simulating classes that are specified, at test time, only by a few example the past visual experiences. Specifically, we train the images. Our model achieves this few-shot generation model to translate images from one class to another class capability by coupling an adversarial training scheme by leveraging few example images of the another class.

artificial intelligence, object-oriented architecture, source class, (19 more...)

arXiv.org Machine Learning

1905.01723

Genre: Research Report (0.64)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Multimodal Unsupervised Image-to-Image Translation

Huang, Xun, Liu, Ming-Yu, Belongie, Serge, Kautz, Jan

arXiv.org Machine LearningApr-12-2018

Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any pairs of corresponding images. While this conditional distribution is inherently multimodal, existing approaches make an overly simplified assumption, modeling it as a deterministic one-to-one mapping. As a result, they fail to generate diverse outputs from a given source domain image. To address this limitation, we propose a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework. We assume that the image representation can be decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To translate an image to another domain, we recombine its content code with a random style code sampled from the style space of the target domain. We analyze the proposed framework and establish several theoretical results. Extensive experiments with comparisons to the state-of-the-art approaches further demonstrates the advantage of the proposed framework. Moreover, our framework allows users to control the style of translation outputs by providing an example style image. Code and pretrained models are available at https://github.com/nvlabs/MUNIT.

artificial intelligence, neural network, translation, (15 more...)

arXiv.org Machine Learning

1804.04732

Genre: Research Report (0.84)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Stacked Generative Adversarial Networks

Huang, Xun, Li, Yixuan, Poursaeed, Omid, Hopcroft, John, Belongie, Serge

arXiv.org Machine LearningApr-12-2017

In this paper, we propose a novel generative model named Stacked Generative Adversarial Networks (SGAN), which is trained to invert the hierarchical representations of a bottom-up discriminative network. Our model consists of a top-down stack of GANs, each learned to generate lower-level representations conditioned on higher-level representations. A representation discriminator is introduced at each feature hierarchy to encourage the representation manifold of the generator to align with that of the bottom-up discriminative network, leveraging the powerful discriminative representations to guide the generative model. In addition, we introduce a conditional loss that encourages the use of conditional information from the layer above, and a novel entropy loss that maximizes a variational lower bound on the conditional entropy of generator outputs. We first train each stack independently, and then train the whole model end-to-end. Unlike the original GAN that uses a single noise vector to represent all the variations, our SGAN decomposes variations into multiple levels and gradually resolves uncertainties in the top-down generative process. Based on visual inspection, Inception scores and visual Turing test, we demonstrate that SGAN is able to generate images of much higher quality than GANs without stacking.

deep learning, neural network, representation, (18 more...)

arXiv.org Machine Learning

1612.04357

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback