Large Language Models can Share Images, Too!

Lee, Young-Jun, Hyeon, Jonghwan, Choi, Ho-Jin

arXiv.org Artificial Intelligence 

This paper explores the image-sharing capability of Large Language Models (LLMs), such as InstructGPT, ChatGPT, and GPT-4, in a zero-shot setting, without the help of visual foundation models. Inspired by the two-stage process of imagesharing in human dialogues, we propose a two-stage framework that allows LLMs to predict potential image-sharing turns and generate related image descriptions using our effective restriction-based prompt template. With extensive experiments, we unlock the image-sharing capability of LLMs in zero-shot prompting, with GPT-4 achieving the best performance. Additionally, we uncover the emergent image-sharing ability in zero-shot prompting, demonstrating the effectiveness of restriction-based prompts in both stages of our framework. Based on this framework, we augment the PhotoChat dataset with images generated by Stable Diffusion at predicted turns, namely PhotoChat++. To our knowledge, this is the first study to assess the image-sharing ability of LLMs in a zero-shot setting without visual foundation models. The source code and the dataset will be released after publication. People often share a variety of images during interactions via instant messaging tools. In practice theory, this is referred to as photo-sharing behavior (Lobinger, 2016), which is interpreted as a communicative practice. From now on, we refer to this as image-sharing behavior, given that "image" is a broader concept than "photo," thereby providing more flexibility to language models. This behavior involves two or more individuals sharing images for various purposes, such as discussion or self-expression, during a dialogue. For example, while conversing about pets with a friend, one might share an image of their pet (e.g., a dog) to talk about the image itself. Hence, the capability to share images is also necessary for a multi-modal dialogue model to enhance social bonding (rapport) with interlocutors. However, in the multi-modal dialogue domain, most previous studies have primarily focused on image-grounded dialogues, where two persons talk about given images (Antol et al., 2015; Das et al., 2017; Shuster et al., 2020), which usually happens after sharing an image. Contrary to these prior studies, we believe that large language models, which do not contain the ability of visual understanding, can share relevant images to some degree without any help of visual foundation models. Prompt engineering has unlocked the potential of language models for various unseen tasks by skillfully manipulating input prompts with instructions.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found