MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Ni, Jinjie, Song, Yifan, Ghosal, Deepanway, Li, Bo, Zhang, David Junhao, Yue, Xiang, Xue, Fuzhao, Zheng, Zian, Zhang, Kaichen, Shah, Mahir, Jain, Kabir, You, Yang, Shieh, Michael

Oct-18-2024–arXiv.org Artificial Intelligence

Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any, real-world benchmark designed to optimize and standardize evaluations across diverse input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions. Meanwhile, MixEval-X's model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98) while being much more efficient. We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-18-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Russia (0.04)
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States
  - Louisiana > Orleans Parish > New Orleans (0.04)
- Asia
  - Russia (0.04)
  - Singapore (0.04)

Genre:
- Research Report (0.81)

Industry:
- Education (0.92)
- Media (0.92)
- Leisure & Entertainment > Sports (0.67)
- Information Technology (0.67)
- Health & Medicine (0.67)

Technology:
- Information Technology
  - Communications > Social Media
    - Crowdsourcing (0.87)
  - Artificial Intelligence
    - Vision (1.00)
    - Representation & Reasoning (1.00)
    - Natural Language
      - Large Language Model (1.00)
      - Chatbot (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)