Text-InfusedAttentionandForeground-Aware ModelingforZero-ShotTemporalActionDetection

Neural Information Processing Systems 

Our simple approach results insuperior performance compared toprevious methods. Despite this improvement, we further identify a common-action bias issue that the cross-modal baseline over-focus on common sub-actions due to a lack of ability todiscriminate text-related visual parts.