Supplementary Materials for MLP-Mixer: An all-MLP Architecture for Vision

Neural Information Processing Systems 

We did not observe any noticeable improvements. In other words, token-mixing MLPs operate by looking at only one channel at once. All layers in Mixer retain the same, isotropic design. Table 1: Hyperparameter settings used for pre-training Mixer models. However, these did not lead to consistent improvements, so we dropped them.