Mixer

Neural Information Processing Systems 

Groupingthechannelstogether Token-mixingMLPstake S-dimensionalvectorsasinputs.Every such vector contains values of asingle feature acrossS different spatial locations. In other words, token-mixing MLPs operate by looking at onlyone channel at once. Forstochasticdepth,followingtheoriginal paper [3], we linearly increase the probability of dropping a layer from0.0 Models are fine-tuned at resolution 224 unless mentioned otherwise. We follow the setup of [2].