Goto

Collaborating Authors

 Zhou, Zhuoran


PackDiT: Joint Human Motion and Text Generation via Mutual Prompting

arXiv.org Artificial Intelligence

Human motion generation has advanced markedly with the advent of diffusion models. Most recent studies have concentrated on generating motion sequences based on text prompts, commonly referred to as text-to-motion generation. However, the bidirectional generation of motion and text, enabling tasks such as motion-to-text alongside text-to-motion, has been largely unexplored. This capability is essential for aligning diverse modalities and supports unconditional generation. In this paper, we introduce PackDiT, the first diffusion-based generative model capable of performing various tasks simultaneously, including motion generation, motion prediction, text generation, text-to-motion, motion-to-text, and joint motion-text generation. Our core innovation leverages mutual blocks to integrate multiple diffusion transformers (DiTs) across different modalities seamlessly. We train PackDiT on the HumanML3D dataset, achieving state-of-the-art text-to-motion performance with an FID score of 0.106, along with superior results in motion prediction and in-between tasks. Our experiments further demonstrate that diffusion models are effective for motion-to-text generation, achieving performance comparable to that of autoregressive models.


Efficient Domain Adaptation via Generative Prior for 3D Infant Pose Estimation

arXiv.org Artificial Intelligence

Although 3D human pose estimation has gained impressive development in recent years, only a few works focus on infants, that have different bone lengths and also have limited data. Directly applying adult pose estimation models typically achieves low performance in the infant domain and suffers from out-of-distribution issues. Moreover, the limitation of infant pose data collection also heavily constrains the efficiency of learning-based models to lift 2D poses to 3D. To deal with the issues of small datasets, domain adaptation and data augmentation are commonly used techniques. Following this paradigm, we take advantage of an optimization-based method that utilizes generative priors to predict 3D infant keypoints from 2D keypoints without the need of large training data. We further apply a guided diffusion model to domain adapt 3D adult pose to infant pose to supplement small datasets. Besides, we also prove that our method, ZeDO-i, could attain efficient domain adaptation, even if only a small number of data is given. Quantitatively, we claim that our model attains state-of-the-art MPJPE performance of 43.6 mm on the SyRIP dataset and 21.2 mm on the MINI-RGBD dataset.


Back to Optimization: Diffusion-based Zero-Shot 3D Human Pose Estimation

arXiv.org Artificial Intelligence

Learning-based methods have dominated the 3D human pose estimation (HPE) tasks with significantly better performance in most benchmarks than traditional optimization-based methods. Nonetheless, 3D HPE in the wild is still the biggest challenge for learning-based models, whether with 2D-3D lifting, image-to-3D, or diffusion-based methods, since the trained networks implicitly learn camera intrinsic parameters and domain-based 3D human pose distributions and estimate poses by statistical average. On the other hand, the optimization-based methods estimate results case-by-case, which can predict more diverse and sophisticated human poses in the wild. By combining the advantages of optimization-based and learning-based methods, we propose the \textbf{Ze}ro-shot \textbf{D}iffusion-based \textbf{O}ptimization (\textbf{ZeDO}) pipeline for 3D HPE to solve the problem of cross-domain and in-the-wild 3D HPE. Our multi-hypothesis \textit{\textbf{ZeDO}} achieves state-of-the-art (SOTA) performance on Human3.6M, with minMPJPE $51.4$mm, without training with any 2D-3D or image-3D pairs. Moreover, our single-hypothesis \textit{\textbf{ZeDO}} achieves SOTA performance on 3DPW dataset with PA-MPJPE $40.3$mm on cross-dataset evaluation, which even outperforms learning-based methods trained on 3DPW.


CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility

arXiv.org Artificial Intelligence

With the rapid evolution of large language models (LLMs), there is a growing concern that they may pose risks or have negative social impacts. Therefore, evaluation of human values alignment is becoming increasingly important. Previous work mainly focuses on assessing the performance of LLMs on certain knowledge and reasoning abilities, while neglecting the alignment to human values, especially in a Chinese context. In this paper, we present CValues, the first Chinese human values evaluation benchmark to measure the alignment ability of LLMs in terms of both safety and responsibility criteria. As a result, we have manually collected adversarial safety prompts across 10 scenarios and induced responsibility prompts from 8 domains by professional experts. To provide a comprehensive values evaluation of Chinese LLMs, we not only conduct human evaluation for reliable comparison, but also construct multi-choice prompts for automatic evaluation. Our findings suggest that while most Chinese LLMs perform well in terms of safety, there is considerable room for improvement in terms of responsibility. Moreover, both the automatic and human evaluation are important for assessing the human values alignment in different aspects. The benchmark and code is available on ModelScope and Github.