HEMM: Holistic Evaluation of Multimodal Foundation Models
–Neural Information Processing Systems
Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of realworld applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge.
Neural Information Processing Systems
Mar-20-2025, 10:10:28 GMT
- Country:
- Europe > Switzerland
- North America > United States (1.00)
- Genre:
- Instructional Material (0.67)
- Research Report
- Experimental Study (1.00)
- New Finding (0.92)
- Industry:
- Education (0.92)
- Energy (0.92)
- Government > Regional Government
- Health & Medicine
- Diagnostic Medicine > Imaging (0.93)
- Therapeutic Area (0.92)
- Information Technology (1.00)
- Leisure & Entertainment > Sports (0.67)
- Media > Film (1.00)
- Transportation (0.67)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning > Generative AI (0.46)
- Natural Language
- Chatbot (0.68)
- Large Language Model (1.00)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Machine Learning > Neural Networks
- Communications > Social Media (1.00)
- Human Computer Interaction > Interfaces (0.93)
- Information Management (0.93)
- Sensing and Signal Processing > Image Processing (1.00)
- Artificial Intelligence
- Information Technology