Review -- ResT: An Efficient Transformer for Visual Recognition

#artificialintelligence 

To compress memory, the 2D input token is reshaped into 3D token, and then is fed to a depth-wise convolution (Conv) operation to reduce the height and width dimension by a factor s. To restore this diversity ability, Instance Normalization (IN) is added for the dot product matrix (after Softmax). A simple yet effective spatial attention module calling Pixel Attention (PA) is use to encode positions. Specifically, PA applies a 3 3 depth-wise convolution (with padding 1) operation to get the pixel-wise weight and then scaled by a sigmoid function σ.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found