Relational Self-Attention: What's Missing in Attention for Video Understanding Supplementary Material

Apr-25-2026, 15:32:23 GMT–Neural Information Processing Systems

We use TSN-ResNet [11] as our backbone (see Table 1) and initialize it with ImageNet-pretrained weights [4]. We replace its 7 spatial convolutional layers with the RSA layers; for every two ResNet blocks from the third block in res2 to the second block in res5, each spatial convolutional layer is replaced with the RSA layer. For the bottlenecks including RSA layers, we randomly initialize weights using MSRA initialization [3] and set the gamma parameter of the last batch normalization layer to zero. We resize the resolution of each frame to 240 320, and apply random cropping as 224 224, scale jittering, and random horizontal flipping for data augmentation. Note that we do not flip videos of which action labels include'left' or'right' words, e.g., 'pulling something from left to right'.

artificial intelligence, machine learning, video understanding, (15 more...)

Neural Information Processing Systems

Apr-25-2026, 15:32:23 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.55)
  - Vision > Video Understanding (0.40)

Duplicate Docs Excel Report

Title
RelationalSelf-Attention: What'sMissinginAttentionforVideoUnderstanding SupplementaryMaterial

Similar Docs Excel Report more

Title	Similarity	Source
None found