verb
Visual Diversity and Region-aware Prompt Learning for Zero-shot HOIDetection
Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction--including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding.
9d411e87d0f37059f40fb27c5de00ba0-Supplemental-Datasets_and_Benchmarks_Track.pdf
The following section is answers to questions listed in datasheets for datasets.858 A.1 Motivation859 Question: For what purpose was the dataset created? Was there a specific task in mind?860 Was there a specific gap that needed to be filled? Answer: To evaluate the linguistic robustness of language models across diverse English862 varieties by transforming Standard American English (SAE) datasets.863 Question: Who created the dataset (e.g., which team, research group) and on behalf of864 which entity (e.g., company, institution, organization)?865 Answer: The authors of this paper.866 Question: Who funded the creation of the dataset? If there is an associated grant, please867 provide the name of the grantor and the grant name and number.868
Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection
Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the --including (1), where instances of the same verb appear in diverse poses and contexts, and (2), where distinct verbs yield visually similar patterns.
49d1cf22327c51331cbd52bcb76a09a6-Supplemental-Conference.pdf
ConceptNet488 comprises commonly observed entities and their connections, where edge weights signify the re-489 liability and frequency of these relationships. To prevent the redundancy of common information and to maintain the validity of the enriched491 relations, we categorized the relationships based on their weights. Relationships with weights less492 than 1 were deemed "weak" and those with a weight of 1 were labeled "average". We refrained from493 using these categories for relation enhancement. Instead, only relationships with weights greater than494 1, indicative of high reliability, were employed for augmenting the relations.495