Goto

Collaborating Authors

 ti-fad


Text-InfusedAttentionandForeground-Aware ModelingforZero-ShotTemporalActionDetection

Neural Information Processing Systems

Our simple approach results insuperior performance compared toprevious methods. Despite this improvement, we further identify a common-action bias issue that the cross-modal baseline over-focus on common sub-actions due to a lack of ability todiscriminate text-related visual parts.


Text-Infused Attention and Foreground-Aware Modeling for Zero-Shot Temporal Action Detection

Neural Information Processing Systems

Zero-Shot Temporal Action Detection (ZSTAD) aims to classify and localize action segments in untrimmed videos for unseen action categories. Most existing ZSTAD methods utilize a foreground-based approach, limiting the integration of text and visual features due to their reliance on pre-extracted proposals. In this paper, we introduce a cross-modal ZSTAD baseline with mutual cross-attention, integrating both text and visual information throughout the detection process. Our simple approach results in superior performance compared to previous methods. Despite this improvement, we further identify a common-action bias issue that the cross-modal baseline over-focus on common sub-actions due to a lack of ability to discriminate text-related visual parts. To address this issue, we propose Text-infused attention and Foreground-aware Action Detection (Ti-FAD), which enhances the ability to focus on text-related sub-actions and distinguish relevant action segments from the background.



Text-Infused Attention and Foreground-Aware Modeling for Zero-Shot Temporal Action Detection

Neural Information Processing Systems

Zero-Shot Temporal Action Detection (ZSTAD) aims to classify and localize action segments in untrimmed videos for unseen action categories. Most existing ZSTAD methods utilize a foreground-based approach, limiting the integration of text and visual features due to their reliance on pre-extracted proposals. In this paper, we introduce a cross-modal ZSTAD baseline with mutual cross-attention, integrating both text and visual information throughout the detection process. Our simple approach results in superior performance compared to previous methods. Despite this improvement, we further identify a common-action bias issue that the cross-modal baseline over-focus on common sub-actions due to a lack of ability to discriminate text-related visual parts.