AITopics | Liu, Guilin

Collaborating Authors

Liu, Guilin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, null, Bjorck, Johan, Castañeda, Fernando, Cherniadev, Nikita, Da, Xingye, Ding, Runyu, Fan, Linxi "Jim", Fang, Yu, Fox, Dieter, Hu, Fengyuan, Huang, Spencer, Jang, Joel, Jiang, Zhenyu, Kautz, Jan, Kundalia, Kaushil, Lao, Lawrence, Li, Zhiqi, Lin, Zongyu, Lin, Kevin, Liu, Guilin, Llontop, Edith, Magne, Loic, Mandlekar, Ajay, Narayan, Avnish, Nasiriany, Soroush, Reed, Scott, Tan, You Liang, Wang, Guanzhi, Wang, Zu, Wang, Jing, Wang, Qi, Xiang, Jiannan, Xie, Yuqi, Xu, Yinzhen, Xu, Zhenjia, Ye, Seonghyeon, Yu, Zhiding, Zhang, Ao, Zhang, Hao, Zhao, Yizhou, Zheng, Ruijie, Zhu, Yuke

arXiv.org Artificial IntelligenceMar-18-2025

General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-language module (System 2) interprets the environment through vision and language instructions. The subsequent diffusion transformer module (System 1) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. We show that our generalist robot model GR00T N1 outperforms the state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments. Furthermore, we deploy our model on the Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation tasks, achieving strong performance with high data efficiency.

dataset, open foundation model, robot, (14 more...)

arXiv.org Artificial Intelligence

2503.14734

Country:

Europe (0.28)
North America > United States > New York (0.14)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots > Humanoid Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Robots > Manipulation (0.68)

Add feedback

AIDE: Agentically Improve Visual Language Model with Domain Experts

Chiu, Ming-Chang, Liu, Fuxiao, Sapra, Karan, Tao, Andrew, Jacoob, Yaser, Ma, Xuezhe, Yu, Zhiding, Liu, Guilin

arXiv.org Artificial IntelligenceFeb-13-2025

The enhancement of Visual Language Models (VLMs) has traditionally relied on knowledge distillation from larger, more capable models. This dependence creates a fundamental bottleneck for improving state-of-the-art systems, particularly when no superior models exist. We introduce AIDE (Agentic Improvement through Domain Experts), a novel framework that enables VLMs to autonomously enhance their capabilities by leveraging specialized domain expert models. AIDE operates through a four-stage process: (1) identifying instances for refinement, (2) engaging domain experts for targeted analysis, (3) synthesizing expert outputs with existing data, and (4) integrating enhanced instances into the training pipeline. Experiments on multiple benchmarks, including MMMU, MME, MMBench, etc., demonstrate AIDE's ability to achieve notable performance gains without relying on larger VLMs nor human supervision. Our framework provides a scalable, resource-efficient approach to continuous VLM improvement, addressing critical limitations in current methodologies, particularly valuable when larger models are unavailable to access.

domain expert, visual language model

arXiv.org Artificial Intelligence

2502.09051

Genre: Research Report (0.69)

Technology:

Information Technology > Visual Languages (0.60)
Information Technology > Artificial Intelligence > Natural Language (0.60)

Add feedback

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

Li, Zhiqi, Chen, Guo, Liu, Shilong, Wang, Shihao, VS, Vibashan, Ji, Yishen, Lan, Shiyi, Zhang, Hao, Zhao, Yilin, Radhakrishnan, Subhashree, Chang, Nadine, Sapra, Karan, Deshmukh, Amala Sanjay, Rintamaki, Tuomas, Le, Matthieu, Karmanov, Ilia, Voegtle, Lukas, Fischer, Philipp, Huang, De-An, Roman, Timo, Lu, Tong, Alvarez, Jose M., Catanzaro, Bryan, Kautz, Jan, Tao, Andrew, Liu, Guilin, Yu, Zhiding

arXiv.org Artificial IntelligenceJan-20-2025

Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, showing the key role of data strategy in developing frontier VLMs. By studying and building our post-training data strategy from scratch, we share detailed insights into the development processes, aiming to benefit the development of competitive models for the open-source community. Our introduced data strategy, together with training recipes and model design, leads to a family of performant VLMs named Eagle2. Specifically, Eagle2-9B achieves state-of-the-art results across various multimodal benchmarks, matching certain competitive models with up to 70B parameters.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2501.14818

Country: Asia (0.14)

Genre: Research Report > New Finding (0.45)

Industry: Education (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

DiffiT: Diffusion Vision Transformers for Image Generation

Hatamizadeh, Ali, Song, Jiaming, Liu, Guilin, Kautz, Jan, Vahdat, Arash

arXiv.org Artificial IntelligenceDec-4-2023

Diffusion models with their powerful expressivity and high sample quality have enabled many new applications and use-cases in various domains. For sample generation, these models rely on a denoising neural network that generates images by iterative denoising. Yet, the role of denoising network architecture is not well-studied with most efforts relying on convolutional residual U-Nets. In this paper, we study the effectiveness of vision transformers in diffusion-based generative learning. Specifically, we propose a new model, denoted as Diffusion Vision Transformers (DiffiT), which consists of a hybrid hierarchical architecture with a U-shaped encoder and decoder. We introduce a novel time-dependent self-attention module that allows attention layers to adapt their behavior at different stages of the denoising process in an efficient manner. We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation. Our results show that DiffiT is surprisingly effective in generating high-fidelity images, and it achieves state-of-the-art (SOTA) benchmarks on a variety of class-conditional and unconditional synthesis tasks. In the latent space, DiffiT achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset. Repository: https://github.com/NVlabs/DiffiT

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2312.02139

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Ge, Songwei, Nah, Seungjun, Liu, Guilin, Poon, Tyler, Tao, Andrew, Catanzaro, Bryan, Jacobs, David, Huang, Jia-Bin, Liu, Ming-Yu, Balaji, Yogesh

arXiv.org Artificial IntelligenceAug-30-2023

Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a $10\times$ smaller model using significantly less computation than the prior art.

artificial intelligence, machine learning, video diffusion model, (2 more...)

arXiv.org Artificial Intelligence

2305.10474

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

View Generalization for Single Image Textured 3D Models

Bhattad, Anand, Dundar, Aysegul, Liu, Guilin, Tao, Andrew, Catanzaro, Bryan

arXiv.org Artificial IntelligenceJun-10-2021

Humans can easily infer the underlying 3D geometry and texture of an object only from a single 2D image. Current computer vision methods can do this, too, but suffer from view generalization problems - the models inferred tend to make poor predictions of appearance in novel views. As for generalization problems in machine learning, the difficulty is balancing single-view accuracy (cf. training error; bias) with novel view accuracy (cf. test error; variance). We describe a class of models whose geometric rigidity is easily controlled to manage this tradeoff. We describe a cycle consistency loss that improves view generalization (roughly, a model from a generated view should predict the original view well). View generalization of textures requires that models share texture information, so a car seen from the back still has headlights because other cars have headlights. We describe a cycle consistency loss that encourages model textures to be aligned, so as to encourage sharing. We compare our method against the state-of-the-art method and show both qualitative and quantitative improvements.

artificial intelligence, deformation, evolutionary algorithm, (19 more...)

arXiv.org Artificial Intelligence

2106.06533

Country: North America > United States (0.31)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Video-to-Video Synthesis

Wang, Ting-Chun, Liu, Ming-Yu, Zhu, Jun-Yan, Liu, Guilin, Tao, Andrew, Kautz, Jan, Catanzaro, Bryan

Neural Information Processing SystemsFeb-14-2020, 07:28:44 GMT

We study the problem of video-to-video synthesis, whose goal is to learn a mapping function from an input source video (e.g., a sequence of semantic segmentation masks) to an output photorealistic video that precisely depicts the content of the source video. While its image counterpart, the image-to-image translation problem, is a popular topic, the video-to-video synthesis problem is less explored in the literature. Without modeling temporal dynamics, directly applying existing image synthesis approaches to an input video often results in temporally incoherent videos of low visual quality. In this paper, we propose a video-to-video synthesis approach under the generative adversarial learning framework. Through carefully-designed generators and discriminators, coupled with a spatio-temporal adversarial objective, we achieve high-resolution, photorealistic, temporally coherent video results on a diverse set of input formats including segmentation masks, sketches, and poses.

artificial intelligence, machine learning, video-to-video synthesis, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.45)

Add feedback