How to Configure Good In-Context Sequence for Visual Question Answering

Li, Li, Peng, Jiawei, Chen, Huiyi, Gao, Chongyang, Yang, Xu

Dec-3-2023–arXiv.org Artificial Intelligence

Inspired by the success of Large Language Models in dealing with new tasks via In-Context Learning (ICL) in NLP, researchers have also developed Large Vision-Language Models (LVLMs) with ICL capabilities. However, when implementing ICL using these LVLMs, researchers usually resort to the simplest way like random sampling to configure the in-context sequence, thus leading to sub-optimal results. To enhance the ICL performance, in this study, we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations to find the powerful ones. Additionally, through observing the changes of the LVLM outputs by altering the in-context sequence, we gain insights into the inner properties of LVLMs, improving our understanding of them. Specifically, to explore in-context configurations, we design diverse retrieval methods and employ different strategies to manipulate the retrieved demonstrations. Through exhaustive experiments on three VQA datasets: VQAv2, VizWiz, and OK-VQA, we uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance. Our code is provided in: https://github.com/GaryJiajia/OFv2_ICL_VQA.

demonstration, instruction, short answer, (15 more...)

arXiv.org Artificial Intelligence

Dec-3-2023

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland > Zürich > Zürich (0.14)

Genre:
- Research Report > New Finding (0.87)

Technology:
- Information Technology > Artificial Intelligence > Natural Language
  - Large Language Model (1.00)
  - Question Answering (0.84)