Liu, Xiaobai
Discrete Optimal Graph Clustering
Han, Yudong, Zhu, Lei, Cheng, Zhiyong, Li, Jingjing, Liu, Xiaobai
Graph based clustering is one of the major clustering methods. Most of it work in three separate steps: similarity graph construction, clustering label relaxing and label discretization with k-means. Such common practice has three disadvantages: 1) the predefined similarity graph is often fixed and may not be optimal for the subsequent clustering. 2) the relaxing process of cluster labels may cause significant information loss. 3) label discretization may deviate from the real clustering result since k-means is sensitive to the initialization of cluster centroids. To tackle these problems, in this paper, we propose an effective discrete optimal graph clustering (DOGC) framework. A structured similarity graph that is theoretically optimal for clustering performance is adaptively learned with a guidance of reasonable rank constraint. Besides, to avoid the information loss, we explicitly enforce a discrete transformation on the intermediate continuous label, which derives a tractable optimization problem with discrete solution. Further, to compensate the unreliability of the learned labels and enhance the clustering accuracy, we design an adaptive robust module that learns prediction function for the unseen data based on the learned discrete cluster labels. Finally, an iterative optimization strategy guaranteed with convergence is developed to directly solve the clustering results. Extensive experiments conducted on both real and synthetic datasets demonstrate the superiority of our proposed methods compared with several state-of-the-art clustering approaches.
Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation
Fang, Hao-Shu (Shanghai Jiao Tong University) | Xu, Yuanlu (University of California, Los Angeles) | Wang, Wenguan (Beijing Institute of Technology) | Liu, Xiaobai (San Diego State University) | Zhu, Song-Chun (University of California, Los Angeles)
In this paper, we propose a pose grammar to tackle the problem of 3D human pose estimation. Our model directly takes 2D pose as input and learns a generalized 2D-3D mapping function. The proposed model consists of a base network which efficiently captures pose-aligned features and a hierarchy of Bi-directional RNNs (BRNN) on the top to explicitly incorporate a set of knowledge regarding human body configuration (i.e., kinematics, symmetry, motor coordination). The proposed model thus enforces high-level constraints over human poses. In learning, we develop a pose sample simulator to augment training samples in virtual camera views, which further improves our model generalizability. We validate our method on public 3D human pose benchmarks and propose a new evaluation protocol working on cross-view setting to verify the generalization capability of different methods. We empirically observe that most state-of-the-art methods encounter difficulty under such setting while our method can well handle such challenges.
Cross-View People Tracking by Scene-Centered Spatio-Temporal Parsing
Xu, Yuanlu (University of California, Los Angeles) | Liu, Xiaobai (San Diego State University) | Qin, Lei (Chinese Academy of Sciences) | Zhu, Song-Chun (University of California, Los Angeles)
In this paper, we propose a Spatio-temporal Attributed Parse Graph (ST-APG) to integrate semantic attributes with trajectories for cross-view people tracking. Given videos from multiple cameras with overlapping field of view (FOV), our goal is to parse the videos and organize the trajectories of all targets into a scene-centered representation. We leverage rich semantic attributes of human, e.g., facing directions, postures and actions, to enhance cross-view tracklet associations, besides frequently used appearance and geometry features in the literature.In particular, the facing direction of a human in 3D, once detected, often coincides with his/her moving direction or trajectory. Similarly, the actions of humans, once recognized, provide strong cues for distinguishing one subject from the others. The inference is solved by iteratively grouping tracklets with cluster sampling and estimating people semantic attributes by dynamic programming.In experiments, we validate our method on one public dataset and create another new dataset that records people's daily life in public, e.g., food court, office reception and plaza, each of which includes 3-4 cameras. We evaluate the proposed method on these challenging videos and achieve promising multi-view tracking results.
Multi-View 3D Human Tracking in Crowded Scenes
Liu, Xiaobai (San Diego State University)
This paper presents a robust multi-view method for tracking people in 3D scene. Our method distinguishes itself from previous works in two aspects. Firstly, we define a set of binary spatial relationships for individual subjects or pairs of subjects that appear at the same time, e.g. being left or right, being closer or further to the camera, etc. These binary relationships directly reflect relative positions of subjects in 3D scene and thus should be persisted during inference. Secondly, we introduce an unified probabilistic framework to exploit binary spatial constraints for simultaneous 3D localization and cross-view human tracking. We develop a cluster Markov Chain Monte Carlo method to search the optimal solution. We evaluate our method on both public video benchmarks and newly built multi-view video dataset. Results with comparisons showed that our method could achieve state-of-the-art tracking results and meter-level 3D localization on challenging videos.