AITopics | Zeng, Ailing

Collaborating Authors

Zeng, Ailing

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation

Yin, Wanqi, Cai, Zhongang, Wang, Ruisi, Zeng, Ailing, Wei, Chen, Sun, Qingping, Mei, Haiyi, Wang, Yanjun, Pang, Hui En, Zhang, Mingyuan, Zhang, Lei, Loy, Chen Change, Yamashita, Atsushi, Yang, Lei, Liu, Ziwei

arXiv.org Artificial IntelligenceJan-16-2025

Abstract--Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10M training instances from diverse data sources. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation. This task typically uses parametric human performance across a basket of key benchmarks, in order to models (e.g., SMPL-X [1]) as a powerful representation provide a holistic measurement of generalization capabilities. of the human body, face, and hands. With a flurry of Our study underscores the importance of harnessing a diverse datasets entering the scene in recent years [2], [3], multitude of datasets to capitalize on their complementary [4], [5], [6], [7], [8], [9], [10], [11], providing the community nature. Moreover, we contribute a new dataset, SynHand, new opportunities to study various aspects such as capture to provide the community with a long-awaiting benchmark environment, pose distribution, body visibility, and camera for comprehensive hand pose evaluation in a whole-body views. Yet, the state-of-the-art methods channel their attention setting. SynHand features diverse hand poses in close-up towards advancements in architectural designs and human shots, accurately annotated as part of the wholebody remain tethered to a limited selection of these datasets, SMPL-X labels. Accordingly, we establish a systematic benchmark results across various scenarios.

artificial intelligence, inductive learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2501.09782

Country:

Asia > China (0.67)
Asia > Japan > Honshū (0.28)

Genre:

Research Report > New Finding (0.92)
Research Report > Promising Solution (0.68)

Industry:

Education > Educational Setting (0.92)
Health & Medicine (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

IDOL: Instant Photorealistic 3D Human Creation from a Single Image

Zhuang, Yiyu, Lv, Jiaxi, Wen, Hao, Shuai, Qing, Zeng, Ailing, Zhu, Hao, Chen, Shifeng, Yang, Yujiu, Cao, Xun, Liu, Wei

arXiv.org Artificial IntelligenceDec-19-2024

Creating a high-fidelity, animatable 3D full-body avatar from a single image is a challenging task due to the diverse appearance and poses of humans and the limited availability of high-quality training data. To achieve fast and high-quality human reconstruction, this work rethinks the task from the perspectives of dataset, model, and representation. First, we introduce a large-scale HUman-centric GEnerated dataset, HuGe100K, consisting of 100K diverse, photorealistic sets of human images. Each set contains 24-view frames in specific human poses, generated using a pose-controllable image-to-multi-view model. Next, leveraging the diversity in views, poses, and appearances within HuGe100K, we develop a scalable feed-forward transformer model to predict a 3D human Gaussian representation in a uniform space from a given human image. This model is trained to disentangle human pose, body shape, clothing geometry, and texture. The estimated Gaussians can be animated without post-processing. We conduct comprehensive experiments to validate the effectiveness of the proposed dataset and method. Our model demonstrates the ability to efficiently reconstruct photorealistic humans at 1K resolution from a single input image using a single GPU instantly. Additionally, it seamlessly supports various applications, as well as shape and texture editing tasks.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.14963

Country:

North America (1.00)
Europe (1.00)
Asia > Middle East (1.00)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Add feedback

DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion

Huang, Yukun, Wang, Jianan, Zeng, Ailing, Zha, Zheng-Jun, Zhang, Lei, Liu, Xihui

arXiv.org Artificial IntelligenceSep-25-2024

Leveraging pretrained 2D diffusion models and score distillation sampling (SDS), recent methods have shown promising results for text-to-3D avatar generation. However, generating high-quality 3D avatars capable of expressive animation remains challenging. In this work, we present DreamWaltz-G, a novel learning framework for animatable 3D avatar generation from text. The core of this framework lies in Skeleton-guided Score Distillation and Hybrid 3D Gaussian Avatar representation. Specifically, the proposed skeleton-guided score distillation integrates skeleton controls from 3D human templates into 2D diffusion models, enhancing the consistency of SDS supervision in terms of view and human pose. This facilitates the generation of high-quality avatars, mitigating issues such as multiple faces, extra limbs, and blurring. The proposed hybrid 3D Gaussian avatar representation builds on the efficient 3D Gaussians, combining neural implicit fields and parameterized 3D meshes to enable real-time rendering, stable SDS optimization, and expressive animation. Extensive experiments demonstrate that DreamWaltz-G is highly effective in generating and animating 3D avatars, outperforming existing methods in both visual quality and animation expressiveness. Our framework further supports diverse applications, including human video reenactment and multi-subject scene composition.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2409.17145

Country:

Asia > China > Guangdong Province (0.14)
Asia > Japan > Honshū > Chūbu (0.14)
North America > United States > California (0.14)

Genre: Research Report (0.64)

Industry:

Education (0.68)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

Chen, Junming, Liu, Yunfei, Wang, Jianan, Zeng, Ailing, Li, Yu, Chen, Qifeng

arXiv.org Artificial IntelligenceJan-9-2024

We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-directional information flow from expression to gesture, facilitating improved matching of joint expression-gesture distributions. Furthermore, we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models, offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets, our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, a user study confirms the superiority of DiffSHEG over prior approaches. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.

artificial intelligence, expression, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2401.04747

Country:

North America > United States > New York (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FITS: Modeling Time Series with $10k$ Parameters

Xu, Zhijian, Zeng, Ailing, Xu, Qiang

arXiv.org Artificial IntelligenceJan-5-2024

In this paper, we introduce FITS, a lightweight yet powerful model for time series analysis. Unlike existing models that directly process raw time-domain data, FITS operates on the principle that time series can be manipulated through interpolation in the complex frequency domain, achieving performance comparable to state-ofthe-art models for time series forecasting and anomaly detection tasks. Notably, FITS accomplishes this with a svelte profile of just about 10k parameters, making it ideally suited for edge devices and paving the way for a wide range of applications. The code is available: https://github.com/VEWOXIC/FITS. Time series analysis plays a pivotal role in a myriad of sectors, from healthcare appliances to smart factories. Within these domains, the reliance is often on edge devices like smart sensors, driven by MCUs with limited computational and memory resources. Time series data, marked by its inherent complexity and dynamism, typically presents information that is both sparse and scattered within the time domain. To effectively harness this data, recent research has given rise to sophisticated models and methodologies (Zhou et al., 2021; Liu et al., 2022a; Zeng et al., 2023; Nie et al., 2023; Zhang et al., 2022). Yet, the computational and memory costs of these models makes them unsuitable for resource-constrained edge devices. On the other hand, the frequency domain representation of time series data promises a more compact and efficient portrayal of inherent patterns. While existing research has indeed tapped into the frequency domain for time series analysis -- FEDformer (Zhou et al., 2022a) enriches its features using spectral data, and TimesNet (Wu et al., 2023) harnesses high-amplitude frequencies for feature extraction via CNNs -- a comprehensive utilization of the frequency domain's compactness remains largely unexplored. Specifically, the ability of the frequency domain to employ complex numbers in capturing both amplitude and phase information is not utilized, resulting in the continued reliance on compute-intensive models for temporal feature extraction. In this study, we reinterpret time series analysis tasks, such as forecasting and reconstruction, as interpolation exercises within the complex frequency domain.

artificial intelligence, data mining, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2307.03756

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction

Wang, Yinhuai, Lin, Jing, Zeng, Ailing, Luo, Zhengyi, Zhang, Jian, Zhang, Lei

arXiv.org Artificial IntelligenceDec-7-2023

Humans interact with objects all the time. Enabling a humanoid to learn human-object interaction (HOI) is a key step for future smart animation and intelligent robotics systems. However, recent progress in physics-based HOI requires carefully designed task-specific rewards, making the system unscalable and labor-intensive. This work focuses on dynamic HOI imitation: teaching humanoid dynamic interaction skills through imitating kinematic HOI demonstrations. It is quite challenging because of the complexity of the interaction between body parts and objects and the lack of dynamic HOI data. To handle the above issues, we present PhysHOI, the first physics-based whole-body HOI imitation approach without task-specific reward designs. Except for the kinematic HOI representations of humans and objects, we introduce the contact graph to model the contact relations between body parts and objects explicitly. A contact graph reward is also designed, which proved to be critical for precise HOI imitation. Based on the key designs, PhysHOI can imitate diverse HOI tasks simply yet effectively without prior knowledge. To make up for the lack of dynamic HOI scenarios in this area, we introduce the BallPlay dataset that contains eight whole-body basketball skills. We validate PhysHOI on diverse HOI tasks, including whole-body grasping and basketball skills.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2312.04393

Country: Asia (0.14)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Sports > Basketball (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Model-Based Reasoning (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

DreamWaltz: Make a Scene with Complex 3D Animatable Avatars

Huang, Yukun, Wang, Jianan, Zeng, Ailing, Cao, He, Qi, Xianbiao, Shi, Yukai, Zha, Zheng-Jun, Zhang, Lei

arXiv.org Artificial IntelligenceNov-5-2023

We present DreamWaltz, a novel framework for generating and animating complex 3D avatars given text guidance and parametric human body prior. While recent methods have shown encouraging results for text-to-3D generation of common objects, creating high-quality and animatable 3D avatars remains challenging. To create high-quality 3D avatars, DreamWaltz proposes 3D-consistent occlusion-aware Score Distillation Sampling (SDS) to optimize implicit neural representations with canonical poses. It provides view-aligned supervision via 3D-aware skeleton conditioning which enables complex avatar generation without artifacts and multiple faces. For animation, our method learns an animatable 3D avatar representation from abundant image priors of diffusion model conditioned on various poses, which could animate complex non-rigged avatars given arbitrary poses without retraining. Extensive evaluations demonstrate that DreamWaltz is an effective and robust approach for creating 3D avatars that can take on complex shapes and appearances as well as novel poses for animation. The proposed framework further enables the creation of complex scenes with diverse compositions, including avatar-avatar, avatar-object and avatar-scene interactions. See https://dreamwaltz3d.github.io/ for more vivid 3D avatar and animation results.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2305.12529

Country: Asia > Middle East > Israel (0.14)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

FrAug: Frequency Domain Augmentation for Time Series Forecasting

Chen, Muxi, Xu, Zhijian, Zeng, Ailing, Xu, Qiang

arXiv.org Artificial IntelligenceFeb-18-2023

Data augmentation (DA) has become a de facto solution to expand training data size for deep learning. With the proliferation of deep models for time series analysis, various time series DA techniques are proposed in the literature, e.g., cropping-, warping-, flipping-, and mixup-based methods. However, these augmentation methods mainly apply to time series classification and anomaly detection tasks. In time series forecasting (TSF), we need to model the fine-grained temporal relationship within time series segments to generate accurate forecasting results given data in a look-back window. Existing DA solutions in the time domain would break such a relationship, leading to poor forecasting accuracy. To tackle this problem, this paper proposes simple yet effective frequency domain augmentation techniques that ensure the semantic consistency of augmented data-label pairs in forecasting, named FrAug. We conduct extensive experiments on eight widely-used benchmarks with several state-of-the-art TSF deep models. Our results show that FrAug can boost the forecasting accuracy of TSF models in most cases. Moreover, we show that FrAug enables models trained with 1\% of the original training data to achieve similar performance to the ones trained on full training data, which is particularly attractive for cold-start forecasting. Finally, we show that applying test-time training with FrAug greatly improves forecasting accuracy for time series with significant distribution shifts, which often occurs in real-life TSF applications. Our code is available at https://anonymous.4open.science/r/Fraug-more-results-1785.

data mining, forecasting, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2302.09292

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Time Series is a Special Sequence: Forecasting with Sample Convolution and Interaction

Liu, Minhao, Zeng, Ailing, Lai, Qiuxia, Xu, Qiang

arXiv.org Artificial IntelligenceJun-17-2021

Time series is a special type of sequence data, a set of observations collected at even intervals of time and ordered chronologically. Existing deep learning techniques use generic sequence models (e.g., recurrent neural network, Transformer model, or temporal convolutional network) for time series analysis, which ignore some of its unique properties. For example, the downsampling of time series data often preserves most of the information in the data, while this is not true for general sequence data such as text sequence and DNA sequence. Motivated by the above, in this paper, we propose a novel neural network architecture and apply it for the time series forecasting problem, wherein we conduct sample convolution and interaction at multiple resolutions for temporal modeling. The proposed architecture, namelySCINet, facilitates extracting features with enhanced predictability. Experimental results show that SCINet achieves significant prediction accuracy improvement over existing solutions across various real-world time series forecasting datasets. In particular, it can achieve high fore-casting accuracy for those temporal-spatial datasets without using sophisticated spatial modeling techniques. Our codes and data are presented in the supplemental material.

deep learning, neural network, scinet, (21 more...)

arXiv.org Artificial Intelligence

2106.09305

Country: North America > United States > California (0.28)

Genre: Research Report > New Finding (0.34)

Industry:

Energy > Power Industry (1.00)
Information Technology (0.68)
Energy > Renewable > Solar (0.30)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback