Gamified crowd-sourcing of high-quality data for visual fine-tuning

Yadav, Shashank, Tomar, Rohan, Jain, Garvit, Ahooja, Chirag, Chaudhary, Shubham, Elkan, Charles

Oct-7-2024–arXiv.org Artificial Intelligence

There are 10 images in a session, 5 tainted and 5 untainted; players don't know which images are tainted. At the end of the session, they get 20 points per tainted image where they chose "Wrong Answer" and the model had been instructed to answer incorrectly. This paper introduces gamified adversarial prompting (GAP), a framework that crowd-sources high-quality data for visual instruction tuning of large multimodal models. GAP transforms the data collection process into an engaging game, incentivizing players to provide fine-grained, challenging questions and answers that target gaps in the model's knowledge. Our contributions include (1) an approach to capture question-answer pairs from humans that directly address weaknesses in a model's knowledge, (2) a method for evaluating and rewarding players that successfully incentivizes them to provide high-quality submissions, and (3) a scalable, gamified platform that succeeds in collecting this data from over 50,000 participants in just a few weeks. Our implementation of GAP has significantly improved the accuracy of a small multimodal model, namely MiniCPM-Llama3-V-2.5-8B, Moreover, we demonstrate that the data generated using MiniCPM-Llama3-V-2.5-8B Specifically, the same data improves the performance of QWEN2-VL-2B and QWEN2-VL-7B on the same multiple benchmarks. Visual question answering (VQA) has emerged as a crucial paradigm in AI, extending beyond mere visual interpretation to facilitate broader and deeper understanding in models. Studies demonstrate VQA's potential in enhancing general knowledge acquisition, transfer learning, and complex reasoning skills. Mahdisoltani et al. (2018) showed that pretraining on complex visual-linguistic tasks significantly improves performance across diverse downstream applications, from textual generation to fine-grained classification. The encoding of visual information as language, explored in works like Something-Else (Materzynska et al., 2020; Girdhar & Ramanan, 2019), and more recently by Alayrac et al. (2022), enables models to develop low-level visual skills that support sophisticated reasoning in multimodal contexts.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-7-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Switzerland
  - Zürich > Zürich (0.14)
- North America > United States
  - California (0.14)
  - New York (0.14)

Genre:
- Research Report (1.00)

Industry:
- Leisure & Entertainment > Games > Computer Games (0.68)

Technology:
- Information Technology
  - Artificial Intelligence
    - Cognitive Science (1.00)
    - Machine Learning > Neural Networks (0.46)
    - Natural Language > Large Language Model (0.69)
    - Representation & Reasoning > Uncertainty (0.46)
    - Vision (1.00)
  - Communications > Social Media
    - Crowdsourcing (0.70)