LOVM: Language-Only Vision Model Selection Appendix Table of Contents 14 A.1 LOVM Benchmark - Datasets 14 A.2 LOVM Benchmark - Vision-Language Models

Neural Information Processing Systems 

We evaluated 35 on 23, a total of 805 evaluations. This constituted the bulk of our compute with a total of 4 days on an nvidia V100 instance. A.1 LOVM Benchmark - Datasets The proposed LOVM benchmark comprises of 23 datasets, which were selected to maximize diversity. Specifically, these datasets vary in the number of classes (2 to 1000), their target tasks, and domains. The benchmark encompasses a comprehensive range of tasks such as classification, scene understanding, geolocalization, object counting, and more, with the goal of rendering it extensively applicable across many applications. Further, the datasets span diverse domains, including natural, satellite, text, and medical images (See Tab. 3 for a comprehensive account of the datasets and their source). To ensure maximal compatibility, we have opted for tasks that permit the utilization of the same VLM architecture, precluding any requisite alterations or additional training. This approach necessitated the exclusion of tasks such as segmentation and object detection, which mandate additional training modules, introducing extraneous noise while evaluating VLM performance.