Goto

Collaborating Authors

 context gt


Supplementary AViT 3B model

Neural Information Processing Systems

The ViT model we use in this work is based on a standard Vision Transformer [7] model scaled to577 nearly 3 billion parameters, using a patch size of 14, 16 heads, 64 blocks, an MLP dimension of 8192578 and a hidden dimension of 2048. The model is defined and trained in Lingvo [32]; we additionally579 employ GSPMD [41] for training. The model is pre-trained on JFT-3B [35] using training settings580 that optimize for performance on JFT-3B rather than for fine-tuning on ImageNet; notably, we do not581 use the training recipe that helps few-shot transfer performance [44]. BReview tools586 We include screenshots of the reviewing tools we built to analyze model mistakes. Figure 3 shows587 the UI for reviewing model predictions and Figure 4 shows the UI that displays the labeling guide588 and slide bar to browse images for a particular class.