3D Pose Estimation and Future Motion Prediction from 2D Images

Yang, Ji, Ma, Youdong, Zuo, Xinxin, Wang, Sen, Gong, Minglun, Cheng, Li

arXiv.org Artificial Intelligence 

In many recent efforts [1, 2, 3, 4], 3D human pose estimation has been decomposed into a two-stage process: first, the 2D keypoints that correspond to the body joints are detected from the 2D image, after which the detected joints are lifted to obtain 3D pose. This type of solution is elegant in terms of the simplicity of problem formulation, unfortunately it suffers from inherent ambiguities caused by projection: different 3D poses can share the same 2D pose projection given a specific viewpoint; that is, the mapping between the 2D joints detection and 3D pose is not bijective. To resolve this ambiguity of 3D pose estimation from a monocular image, video-based pose estimation is also investigated in the literature [5, 6]. Existing video-based pose estimation methods, however, either need to observe a relatively long history (243 frames [5] or can only handle a short video sequence (4-6 frames [6]) to achieve their best results.