Shifted Chunk Transformer for Spatio-Temporal Representational Learning

Neural Information Processing Systems 

We use four layered clip encoder in our experiment. See Table 2. Pretraining on large amount of data yields better top-1 We further compare the ViL T with convolution variants and one Transformer variant, i.e., LSH attention. We compare ViL T (78.4%, 98.3%) with various convolution We have convolution (73.9%, 94.9%), Empirically, we compare the shifted MSA with various attentions, i.e., space attention (conventional From the perspective of human vision system, the typical duration of persistence of vision is 0.1-0.4s. The shifted MSA is forced to learn fine-grained motion information.