Goto

Collaborating Authors

 pose estimation



MBW: Multi-view Bootstrapping in the Wild

Neural Information Processing Systems

Labeling articulated objects in unconstrained settings has a wide variety of applications including entertainment, neuroscience, psychology, ethology, and many fields of medicine. Large offline labeled datasets do not exist for all but the most common articulated object categories (e.g., humans). Hand labeling these landmarks within a video sequence is a laborious task. Learned landmark detectors can help, but can be error-prone when trained from only a few examples. Multi-camera systems that train fine-grained detectors have shown significant promise in detecting such errors, allowing for self-supervised solutions that only need a small percentage of the video sequence to be hand-labeled.


H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation

Neural Information Processing Systems

Human hands possess remarkable dexterity and have long served as a source of inspiration for robotic manipulation. In this work, we propose a human HandInformed visual representation learning framework to solve difficult Dexterous manipulation tasks (H-InDex) with reinforcement learning. Our framework consists of three stages: (i) pre-training representations with 3D human hand pose estimation, (ii) offline adapting representations with self-supervised keypoint detection, and (iii) reinforcement learning with exponential moving average BatchNorm. The last two stages only modify 0.36%parameters of the pre-trained representation in total, ensuring the knowledge from pre-training is maintained to the full extent. We empirically study 12 challenging dexterous manipulation tasks and find that HInDex largely surpasses strong baseline methods and the recent visual foundation models for motor control. Code is available at yanjieze.com/H-InDex.


Continuous Heatmap Regression for Pose Estimation via Implicit Neural Representation

Neural Information Processing Systems

Heatmap regression has dominated human pose estimation due to its superior performance and strong generalization. To meet the requirements of traditional explicit neural networks for output form, existing heatmap-based methods discretize the originally continuous heatmap representation into 2D pixel arrays, which leads to performance degradation due to the introduction of quantization errors. This problem is significantly exacerbated as the size of the input image decreases, which makes heatmap-based methods not much better than coordinate regression on low-resolution images. In this paper, we propose a novel neural representation for human pose estimation called NerPE to achieve continuous heatmap regression. Given any position within the image range, NerPE regresses the corresponding confidence scores for body joints according to the surrounding image features, which guarantees continuity in space and confidence during training. Thanks to the decoupling from spatial resolution, NerPE can output the predicted heatmaps at arbitrary resolution during inference without retraining, which easily achieves sub-pixel localization precision. To reduce the computational cost, we design progressive coordinate decoding to cooperate with continuous heatmap regression, in which localization no longer requires the complete generation of high-resolution heatmaps.


ChimpACT: ALongitudinal Dataset for Understanding Chimpanzee Behaviors

Neural Information Processing Systems

Understanding the behavior of non-human primates is crucial for improving animal welfare, modeling social behavior, and gaining insights into distinctively human and phylogenetically shared behaviors. However, the lack of datasets on non-human primate behavior hinders in-depth exploration of primate social interactions, posing challenges to research on our closest living relatives. To address these limitations, we present ChimpACT, a comprehensive dataset for quantifying the longitudinal behavior and social relations of chimpanzees within a social group. Spanning from 2015 to 2018, ChimpACT features videos of a group of over 20 chimpanzees residing at the Leipzig Zoo, Germany, with a particular focus on documenting the developmental trajectory of one young male, Azibo.



To_The_Point__Correspondence_driven_self_supervised_3D_reconstruction.pdf

Neural Information Processing Systems

Every image is encoded using an ImageNet pre-trained ResNet18 to a latent feature map z R4 4 256. A flattened version of z is processed with one linear layer with output channels equal to N 3to get the predictions for points u and visibility v. We apply the sigmoid function to the visibility predictions v to enforce a numerical range [0,1]. Our models are trained using Adam optimizer with learning rate equal to 1e-4. In detail, scale is sampled from the range [0.7, 1.2], vertical translation is up to 38 pixels and we also apply 2D rotation up to 40 degrees. For camera equivariance the image is simply flipped horizontally and given as input to the network to estimate the pose.



HRFormer: High-Resolution Transformer for Dense Prediction

Neural Information Processing Systems

We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet [46]), along with local-window self-attention that performs self-attention over small non-overlapping image windows [21], for improving the memory and computation efficiency. In addition, we introduce a convolution into the FFN to exchange information across the disconnected image windows. We demonstrate the effectiveness of the HighResolution Transformer on both human pose estimation and semantic segmentation tasks, e.g., HRFormer outperforms Swin transformer [27] by 1.3 AP on COCO pose estimation with 50% fewer parameters and 30% fewer FLOPs. Code is available at: https://github.com/HRNet/HRFormer.