Review -- ResT: An Efficient Transformer for Visual Recognition
To compress memory, the 2D input token is reshaped into 3D token, and then is fed to a depth-wise convolution (Conv) operation to reduce the height and width dimension by a factor s. To restore this diversity ability, Instance Normalization (IN) is added for the dot product matrix (after Softmax). A simple yet effective spatial attention module calling Pixel Attention (PA) is use to encode positions. Specifically, PA applies a 3 3 depth-wise convolution (with padding 1) operation to get the pixel-wise weight and then scaled by a sigmoid function σ.
Oct-29-2022, 00:10:18 GMT
- Technology: