Supplementary Material - WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models

Neural Information Processing Systems 

Q1 For what purpose was the dataset created? Was there a specific task in mind? Q2 Who created the dataset (e.g., which team, research group) and on behalf of which Q3 Who funded the creation of the dataset? Q1 What do the instances that comprise the dataset represent (e.g., documents, photos, Are there multiple types of instances (e.g., movies, users, and ratings; Q2 How many instances are there in total (of each type, if appropriate)? Is the sample representative of the larger set (e.g., geographic coverage)?