What Makes Multimodal In-Context Learning Work?

Baldassini, Folco Bertini, Shukor, Mustafa, Cord, Matthieu, Soulier, Laure, Piwowarski, Benjamin

Apr-25-2024–arXiv.org Artificial Intelligence

Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at https://gitlab.com/folbaeni/multimodal-icl

demonstration, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Apr-25-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - UAE (0.14)
- Europe > France (0.14)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Education (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)
  - Natural Language > Large Language Model (1.00)
  - Vision (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found