SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation

Yin, Wanqi, Cai, Zhongang, Wang, Ruisi, Zeng, Ailing, Wei, Chen, Sun, Qingping, Mei, Haiyi, Wang, Yanjun, Pang, Hui En, Zhang, Mingyuan, Zhang, Lei, Loy, Chen Change, Yamashita, Atsushi, Yang, Lei, Liu, Ziwei

Jan-16-2025–arXiv.org Artificial Intelligence

Abstract--Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10M training instances from diverse data sources. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation. This task typically uses parametric human performance across a basket of key benchmarks, in order to models (e.g., SMPL-X [1]) as a powerful representation provide a holistic measurement of generalization capabilities. of the human body, face, and hands. With a flurry of Our study underscores the importance of harnessing a diverse datasets entering the scene in recent years [2], [3], multitude of datasets to capitalize on their complementary [4], [5], [6], [7], [8], [9], [10], [11], providing the community nature. Moreover, we contribute a new dataset, SynHand, new opportunities to study various aspects such as capture to provide the community with a long-awaiting benchmark environment, pose distribution, body visibility, and camera for comprehensive hand pose evaluation in a whole-body views. Yet, the state-of-the-art methods channel their attention setting. SynHand features diverse hand poses in close-up towards advancements in architectural designs and human shots, accurately annotated as part of the wholebody remain tethered to a limited selection of these datasets, SMPL-X labels. Accordingly, we establish a systematic benchmark results across various scenarios.

artificial intelligence, inductive learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Jan-16-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China (0.67)
  - Japan > Honshū (0.28)

Genre:
- Research Report
  - New Finding (0.92)
  - Promising Solution (0.68)

Industry:
- Education > Educational Setting (0.92)
- Health & Medicine (0.66)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Inductive Learning (0.86)
    - Neural Networks (0.67)
  - Vision (1.00)