Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Huang, Yupan, Meng, Zaiqiao, Liu, Fangyu, Su, Yixuan, Collier, Nigel, Lu, Yutong
–arXiv.org Artificial Intelligence
Large language models exhibit enhanced zero-shot performance on various tasks when fine-tuned with instruction-following data. Multimodal instruction-following models extend these capabilities by integrating both text and images. However, existing models such as MiniGPT-4 face challenges in maintaining dialogue coherence in scenarios involving multiple images. A primary reason is the lack of a specialized dataset for this critical application. To bridge these gaps, we present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images. To support the training, we introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. Furthermore, we construct SparklesEval, a GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns. Our experiments validate the effectiveness of SparklesChat in understanding and reasoning across multiple images and dialogue turns. Specifically, SparklesChat outperformed MiniGPT-4 on established vision-and-language benchmarks, including the BISON binary image selection task and the NLVR2 visual reasoning task. Moreover, SparklesChat scored 8.56 out of 10 on SparklesEval, substantially exceeding MiniGPT-4's score of 3.91 and nearing GPT-4's score of 9.26. Qualitative evaluations further demonstrate SparklesChat's generality in handling real-world applications. All resources are available at https://github.com/HYPJUDY/Sparkles.
arXiv.org Artificial Intelligence
Oct-1-2023
- Country:
- Europe
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- France > Île-de-France
- United Kingdom > England
- Asia > South Korea
- Gyeonggi-do > Suwon (0.04)
- Europe
- Genre:
- Research Report (0.50)
- Instructional Material (0.45)
- Industry:
- Leisure & Entertainment > Sports (0.68)
- Media (0.67)
- Transportation > Ground (0.46)
- Technology: