Can MLLMs Perform Text-to-Image In-Context Learning?

Zeng, Yuchen, Kang, Wonjun, Chen, Yicong, Koo, Hyung Il, Lee, Kangwook

Feb-2-2024–arXiv.org Artificial Intelligence

The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation. To overcome these challenges, we explore strategies like fine-tuning and Chain-of-Thought prompting, demonstrating notable improvements. Our code and dataset are available at \url{https://github.com/UW-Madison-Lee-Lab/CoBSAT}.

image output, mllm perform text-to-image in-context learning, text input, (10 more...)

arXiv.org Artificial Intelligence

Feb-2-2024

arXiv.org PDF

Add feedback

Country:
- South America > Chile (0.04)
- Asia > Japan (0.04)
- North America > United States
  - Wisconsin > Dane County > Madison (0.04)
- Europe > Germany
  - Bavaria > Upper Bavaria > Munich (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Automobiles & Trucks > Manufacturer (0.67)
- Transportation
  - Passenger (0.67)
  - Ground > Road (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)