HEMM: Holistic Evaluation of Multimodal Foundation Models

Liang, Paul Pu, Goindani, Akshay, Chafekar, Talha, Mathur, Leena, Yu, Haofei, Salakhutdinov, Ruslan, Morency, Louis-Philippe

Jul-3-2024–arXiv.org Artificial Intelligence

Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge. Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion. Use cases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and human-computer interaction applications. Through comprehensive experiments across the 30 tasks in HEMM, we (1) identify key dataset dimensions (e.g., basic skills, information flows, and use cases) that pose challenges to today's models, and (2) distill performance trends regarding how different modeling dimensions (e.g., scale, pre-training data, multimodal alignment, pre-training, and instruction tuning objectives) influence performance. Our conclusions regarding challenging multimodal interactions, use cases, and tasks requiring reasoning and external knowledge, the benefits of data and model scale, and the impacts of instruction tuning yield actionable insights for future work in multimodal foundation models.

access restriction, arxiv preprint arxiv, dataset, (13 more...)

arXiv.org Artificial Intelligence

Jul-3-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York (0.04)
  - California > Santa Clara County
    - Palo Alto (0.04)
- Europe
  - Switzerland > Zürich
    - Zürich (0.14)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Italy > Tuscany
    - Florence (0.04)
- Asia
  - South Korea > Daegu
    - Daegu (0.04)
  - Myanmar > Tanintharyi Region
    - Dawei (0.04)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Media > Film (1.00)
- Information Technology (1.00)
- Leisure & Entertainment > Sports (0.67)
- Health & Medicine
  - Diagnostic Medicine > Imaging (0.93)
  - Therapeutic Area (0.92)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Human Computer Interaction > Interfaces (1.00)
  - Communications > Social Media (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Representation & Reasoning (1.00)
    - Natural Language
      - Large Language Model (1.00)
      - Chatbot (0.93)
    - Machine Learning > Neural Networks
      - Deep Learning > Generative AI (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found