SupplementarymaterialforSpace-timeMixing AttentionforVideoTransformer
–Neural Information Processing Systems
Instead we propose two new forms of aggregation: Temporal Attention aggregation and Summary Token. Is space-time attention all you need for video understanding? More is less: Learning efficient video representations bybig-little network and depthwise temporal aggregation.arXiv
Neural Information Processing Systems
Feb-10-2026, 09:59:48 GMT