Relational Self-Attention: What's Missing in Attention for Video Understanding Supplementary Material
–Neural Information Processing Systems
We use TSN-ResNet [11] as our backbone (see Table 1) and initialize it with ImageNet-pretrained weights [4]. We replace its 7 spatial convolutional layers with the RSA layers; for every two ResNet blocks from the third block in res2 to the second block in res5, each spatial convolutional layer is replaced with the RSA layer. For the bottlenecks including RSA layers, we randomly initialize weights using MSRA initialization [3] and set the gamma parameter of the last batch normalization layer to zero. We resize the resolution of each frame to 240 320, and apply random cropping as 224 224, scale jittering, and random horizontal flipping for data augmentation. Note that we do not flip videos of which action labels include'left' or'right' words, e.g., 'pulling something from left to right'.
Neural Information Processing Systems
Apr-25-2026, 15:32:23 GMT
- Technology: