cappa
A Transfer and finetuning details Few-shot evaluation We use the linear adaptation protocol and evaluation sets from [ 68
For each result shown in Figure 1, we select the best setting using 1% of the training data that was held-out for this purpose, and report its accuracy on the 50 000 images in the validation set. Full numeric results are provided in Table 10 . In all cases, we select the best model on a held-out 2% of the training data and report that model's The best setting uses learning rate 0.00001, layer-wise decay 0.8, Note that the latter does not require re-training for each setting and hence is cheap. We fix rand-augment to (2, 10), Mixup to 0.2, and training duration to 50 000 steps with batch-size 512, without revisiting these choices. The best setting uses learning rate 0.0001, layer-wise decay 0.9, and Polyak 0.99999 for This complements the results from Figure 1 (Right).