MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Zheng, Kaizhi, He, Xuehai, Wang, Xin Eric

Oct-5-2023–arXiv.org Artificial Intelligence

Large Language Models (LLMs) have garnered significant attention for their advancements in natural language processing, demonstrating unparalleled prowess in text comprehension and generation. Yet, the simultaneous generation of images with coherent textual narratives remains an evolving frontier. In response, we introduce an innovative interleaved vision-and-language generation technique anchored by the concept of "generative vokens", acting as the bridge for harmonized image-text outputs. Our approach is characterized by a distinctive two-staged training strategy focusing on description-free multimodal generation, where the training requires no comprehensive descriptions of images. To bolster model integrity, classifier-free guidance is incorporated, enhancing the effectiveness of vokens on image generation. Our model, MiniGPT-5, exhibits substantial improvement over the baseline Divter model on the MMDialog dataset and consistently delivers superior or comparable multimodal outputs in human evaluations on the VIST dataset, highlighting its efficacy across diverse benchmarks. In the recent development of larger-scale vision-and-language models, multimodal feature integration is not just a evolving trend but a critical advancement shaping a wide array of applications, from multimodal dialogue agents to cutting-edge content creation tools. With the surge in research and development in this domain, vision-and-language models such as (Wu et al., 2023a; Li et al., 2023b; Tsimpoukelli et al., 2021; Alayrac et al., 2022) are on the brink of an era where they are expected to comprehend and generate both text and image content seamlessly. This multi-faceted ability is crucial, as it fosters enhanced interactions across various domains like virtual reality, media, and e-commerce. Essentially, the task is to enable models to coherently synthesize, recognize, and respond using both visual and textual modalities, harmonizing the information flow and creating cohesive narratives.

dataset, minigpt-5, voken, (12 more...)

arXiv.org Artificial Intelligence

Oct-5-2023

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Illinois > Cook County
      - Chicago (0.04)
    - Florida > Hillsborough County
      - Tampa (0.04)
    - California > Santa Cruz County
      - Santa Cruz (0.04)
  - Canada > Newfoundland and Labrador
    - Newfoundland (0.04)
    - Labrador (0.04)
- Europe
  - United Kingdom > Scotland (0.04)
  - Romania > Sud - Muntenia Development Region
    - Giurgiu County > Giurgiu (0.04)

Genre:
- Research Report
  - New Finding (0.67)
  - Promising Solution (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found