A Transfer and finetuning details

Neural Information Processing Systems 

Few-shot evaluation We use the linear adaptation protocol and evaluation sets from [68, 70], reporting the 10-shot classification accuracy. For every combination of data set and model we run the 10-shot adaptation three times and report the mean (and standard deviation for key results). LiT decoder and T5 decoder To train a multi-task decoder from scratch on top of the frozen representation for classification, captioning and VQA, we precisely follow the setup and hyper parameters from [2] except for the data mixing strategy, for which we set to "concat image-question pairs" ([2, Sec. For all encoders, we use the full feature sequence before pooling (including the class token for the evaluation of CLIP). Throughout, we rely on a B-sized transformer decoder [60] with 12 layers.