Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Li, Chunyuan, Gan, Zhe, Yang, Zhengyuan, Yang, Jianwei, Li, Linjie, Wang, Lijuan, Gao, Jianfeng

Sep-18-2023–arXiv.org Artificial Intelligence

This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics -- methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics -- unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Sep-18-2023

arXiv.org PDF

Add feedback

Country:
- Europe > Italy (0.27)

Genre:
- Instructional Material (1.00)
- Overview (1.00)
- Research Report > New Finding (1.00)

Industry:
- Automobiles & Trucks > Manufacturer (0.67)
- Education > Educational Setting (0.47)
- Health & Medicine > Diagnostic Medicine
  - Imaging (0.45)
- Information Technology > Security & Privacy (0.45)
- Leisure & Entertainment > Sports (0.45)
- Transportation
  - Ground > Road (0.45)
  - Passenger (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.46)
  - Natural Language > Large Language Model (1.00)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found