Seeing Beyond the Crop: Using Language Priors for Out-of-Bounding Box Keypoint Prediction
–Neural Information Processing Systems
Accurate estimation of human pose and the pose of interacting objects, like a hockey stick, is crucial for action recognition and performance analysis, particularly in sports. Existing methods capture the object along with the human in the bounding boxes, assuming all keypoints are visible within the bounding box. This necessitates larger bounding boxes to capture the object, introducing unnecessary visual features and hindering performance in real-world cluttered environments. We propose a simple image and text-based multimodal solution TokenCLIPose that addresses this limitation. Our approach focuses solely on human keypoints within the bounding box, treating objects as unseen. TokenCLIPose leverages the rich semantic representations endowed by language for inducing keypoint-specific context, even for occluded keypoints. We evaluate the performance of TokenCLIPose on a real-world ice hockey dataset, and demonstrate its generalizability through zero-shot transfer to a smaller Lacrosse dataset.
Neural Information Processing Systems
Mar-27-2025, 04:14:05 GMT
- Country:
- North America > United States (0.14)
- Genre:
- Research Report > Experimental Study (0.93)
- Industry:
- Leisure & Entertainment > Sports > Hockey (1.00)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (0.68)
- Natural Language (1.00)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Machine Learning > Neural Networks
- Graphics (1.00)
- Sensing and Signal Processing > Image Processing (1.00)
- Artificial Intelligence
- Information Technology