Alignment-guided Temporal Attention for Video Action Recognition

Neural Information Processing Systems 

Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D 1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more efficient in computation, the latter often obtains better performance. In this paper, we attribute this to a dilemma between the sufficiency and the efficiency of interactions among various positions in different frames. These interactions affect the extraction of task-relevant information shared among frames.