Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection
–Neural Information Processing Systems
Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the --including (1), where instances of the same verb appear in diverse poses and contexts, and (2), where distinct verbs yield visually similar patterns.
Neural Information Processing Systems
Jun-13-2026, 20:28:17 GMT
- Technology: