flops acts top 1 error stem kernel size stride padding channels
–Neural Information Processing Systems
Table 3: Stem designs: We compare ViT's standard patchify stem (P) and our convolutional stem (C) to four alternatives (S1 - S4) that each include a patchify layer, i.e., a convolution with kernel size (> 1) equal to stride (highlighted in blue). Results use 50 epoch training, 4GF model size, and optimal lr and wd values for all models. We observe that increasing the pixel size of the patchify layer (S1 - S4) systematically degrades both top-1 error and optimizer stability () relative to C. EDFs are computed by sampling lr and wd values and training for 50 epochs. The table (right) shows 100 epoch results using best lr and wd values found at 50 epochs. The minor gap in error in the EDFs and at 100 epochs indicates that these choices are fairly insignificant.
Neural Information Processing Systems
May-23-2025, 16:24:04 GMT