Supplementary Materials 575 A ViT-3B model details 576 The ViT model we use in this work is based on a standard Vision Transformer [ 7 ] model scaled to 577