Goto

Collaborating Authors

 verb


Visual Diversity and Region-aware Prompt Learning for Zero-shot HOIDetection

Neural Information Processing Systems

Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction--including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding.


9d411e87d0f37059f40fb27c5de00ba0-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

The following section is answers to questions listed in datasheets for datasets.858 A.1 Motivation859 Question: For what purpose was the dataset created? Was there a specific task in mind?860 Was there a specific gap that needed to be filled? Answer: To evaluate the linguistic robustness of language models across diverse English862 varieties by transforming Standard American English (SAE) datasets.863 Question: Who created the dataset (e.g., which team, research group) and on behalf of864 which entity (e.g., company, institution, organization)?865 Answer: The authors of this paper.866 Question: Who funded the creation of the dataset? If there is an associated grant, please867 provide the name of the grantor and the grant name and number.868


Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

Neural Information Processing Systems

Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the --including (1), where instances of the same verb appear in diverse poses and contexts, and (2), where distinct verbs yield visually similar patterns.


49d1cf22327c51331cbd52bcb76a09a6-Supplemental-Conference.pdf

Neural Information Processing Systems

ConceptNet488 comprises commonly observed entities and their connections, where edge weights signify the re-489 liability and frequency of these relationships. To prevent the redundancy of common information and to maintain the validity of the enriched491 relations, we categorized the relationships based on their weights. Relationships with weights less492 than 1 were deemed "weak" and those with a weight of 1 were labeled "average". We refrained from493 using these categories for relation enhancement. Instead, only relationships with weights greater than494 1, indicative of high reliability, were employed for augmenting the relations.495




Learning Action and Reasoning-Centric Image Editing from Videos and Simulations

Neural Information Processing Systems

Object, attribute or stylistic changes can be learned from visually static datasets. On the other hand, high-quality data for action and reasoning-centric edits is scarce and has to come from entirely different sources that cover e.g.



HOIAnalysis: IntegratingandDecomposing Human-ObjectInteraction

Neural Information Processing Systems

In light of this, we propose an Integration-Decomposition Network (IDN) to implement the above transformations and achievestate-of-the-art performance on widely-used HOI detectionbenchmarks.


3493894fa4ea036cfc6433c3e2ee63b0-AuthorFeedback.pdf

Neural Information Processing Systems

We clarify this in two4 aspects. We operate all transformationsin parallel and the inference speed is10.04 Exponential function and hinge loss.A1.