Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

Jun-13-2026, 20:28:17 GMT–Neural Information Processing Systems

Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the --including (1), where instances of the same verb appear in diverse poses and contexts, and (2), where distinct verbs yield visually similar patterns.

artificial intelligence, large language model, natural language, (6 more...)

Neural Information Processing Systems

Jun-13-2026, 20:28:17 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.33)