Jointly Training Large Autoregressive Multimodal Models

Aiello, Emanuele, Yu, Lili, Nie, Yixin, Aghajanyan, Armen, Oguz, Barlas

Sep-28-2023–arXiv.org Artificial Intelligence

In recent years, advances in the large-scale pretraining of language and text-toimage models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixedmodal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose. Autoregressive text-to-image models, as exemplified by works such as Yu et al. (2023; 2022), have made remarkable strides in generating highly detailed images, paralleling the achievements of Diffusion Models Nichol et al. (2022); ...

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Sep-28-2023

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland
  - Zürich > Zürich (0.14)
- North America > United States
  - Hawaii (0.15)

Genre:
- Research Report > New Finding (0.92)

Industry:
- Consumer Products & Services (0.94)
- Education (1.00)
- Health & Medicine
  - Consumer Health (1.00)
  - Therapeutic Area (0.92)
- Leisure & Entertainment (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.48)
  - Natural Language > Large Language Model (0.49)
  - Representation & Reasoning > Commonsense Reasoning (0.46)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found