MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
Zheng, Kaizhi, He, Xuehai, Wang, Xin Eric
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) have garnered significant attention for their advancements in natural language processing, demonstrating unparalleled prowess in text comprehension and generation. Yet, the simultaneous generation of images with coherent textual narratives remains an evolving frontier. In response, we introduce an innovative interleaved vision-and-language generation technique anchored by the concept of "generative vokens", acting as the bridge for harmonized image-text outputs. Our approach is characterized by a distinctive two-staged training strategy focusing on description-free multimodal generation, where the training requires no comprehensive descriptions of images. To bolster model integrity, classifier-free guidance is incorporated, enhancing the effectiveness of vokens on image generation. Our model, MiniGPT-5, exhibits substantial improvement over the baseline Divter model on the MMDialog dataset and consistently delivers superior or comparable multimodal outputs in human evaluations on the VIST dataset, highlighting its efficacy across diverse benchmarks. In the recent development of larger-scale vision-and-language models, multimodal feature integration is not just a evolving trend but a critical advancement shaping a wide array of applications, from multimodal dialogue agents to cutting-edge content creation tools. With the surge in research and development in this domain, vision-and-language models such as (Wu et al., 2023a; Li et al., 2023b; Tsimpoukelli et al., 2021; Alayrac et al., 2022) are on the brink of an era where they are expected to comprehend and generate both text and image content seamlessly. This multi-faceted ability is crucial, as it fosters enhanced interactions across various domains like virtual reality, media, and e-commerce. Essentially, the task is to enable models to coherently synthesize, recognize, and respond using both visual and textual modalities, harmonizing the information flow and creating cohesive narratives.
arXiv.org Artificial Intelligence
Oct-5-2023
- Country:
- North America > United States (0.67)
- Genre:
- Research Report
- New Finding (0.67)
- Promising Solution (0.46)
- Research Report
- Technology: