Chameleon: Mixed-Modal Early-Fusion Foundation Models

May-16-2024–arXiv.org Artificial Intelligence

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

arxiv preprint arxiv, chameleon, evaluation, (15 more...)

arXiv.org Artificial Intelligence

May-16-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Virginia > Fairfax County > Fairfax (0.04)
- Europe > Germany
  - Berlin (0.04)
- Asia
  - China (0.04)
  - Armenia > Shirak
    - Gyumri (0.04)

Genre:
- Research Report (1.00)

Industry:
- Information Technology (0.46)
- Health & Medicine (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found