Shifted Chunk Transformer for Spatio-Temporal Representational Learning
–Neural Information Processing Systems
We use four layered clip encoder in our experiment. See Table 2. Pretraining on large amount of data yields better top-1 We further compare the ViL T with convolution variants and one Transformer variant, i.e., LSH attention. We compare ViL T (78.4%, 98.3%) with various convolution We have convolution (73.9%, 94.9%), Empirically, we compare the shifted MSA with various attentions, i.e., space attention (conventional From the perspective of human vision system, the typical duration of persistence of vision is 0.1-0.4s. The shifted MSA is forced to learn fine-grained motion information.
Neural Information Processing Systems
Nov-14-2025, 06:16:32 GMT
- Country:
- Asia > Middle East
- Oman (0.05)
- South America > Peru
- Cusco Department (0.05)
- Madre de Dios Department (0.05)
- Puno Department (0.05)
- Asia > Middle East
- Genre:
- Research Report (0.36)
- Technology:
- Information Technology > Artificial Intelligence > Vision (0.76)