Analyzing Modular Approaches for Visual Question Decomposition

Khandelwal, Apoorv, Pavlick, Ellie, Sun, Chen

Nov-10-2023–arXiv.org Artificial Intelligence

Modular neural networks without additional training have recently been shown to surpass end-to-end neural networks on challenging vision-language tasks. The latest such methods simultaneously introduce LLM-based code generation to build programs and a number of skill-specific, task-oriented modules to execute them. In this paper, we focus on ViperGPT and ask where its additional performance comes from and how much is due to the (state-of-art, end-to-end) BLIP-2 model it subsumes vs. additional symbolic components. To do so, we conduct a controlled study (comparing end-to-end, modular, and prompting-based methods across several VQA benchmarks). We find that ViperGPT's reported gains over BLIP-2 can be attributed to its selection of task-specific modules, and when we run ViperGPT using a more task-agnostic selection of modules, these gains go away. Additionally, ViperGPT retains much of its performance if we make prominent alterations to its selection of modules: e.g. removing or retaining only BLIP-2. Finally, we compare ViperGPT against a prompting-based decomposition strategy and find that, on some benchmarks, modular approaches significantly benefit by representing subtasks with natural language, instead of code.

benchmark, blip-2, vipergpt, (16 more...)

arXiv.org Artificial Intelligence

Nov-10-2023

arXiv.org PDF

Add feedback

Country:
- Europe > Italy
  - Tuscany > Florence (0.04)
- Asia
  - Japan (0.04)
  - Middle East > UAE (0.04)

Genre:
- Research Report > Experimental Study (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (0.94)
  - Natural Language > Large Language Model (0.69)
  - Machine Learning > Neural Networks (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found