UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

Zhou, Zhaokun, Wang, Qiulin, Lin, Bin, Su, Yiwei, Chen, Rui, Tao, Xin, Zheng, Amin, Yuan, Li, Wan, Pengfei, Zhang, Di

Apr-15-2024–arXiv.org Artificial Intelligence

As an alternative to expensive expert evaluation, Image Aesthetic Assessment (IAA) stands out as a crucial task in computer vision. However, traditional IAA methods are typically constrained to a single data source or task, restricting the universality and broader application. In this work, to better align with human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) framework, including a Multi-modal Large Language Model (MLLM) named UNIAA-LLaVA and a comprehensive benchmark named UNIAA-Bench. We choose MLLMs with both visual perception and language ability for IAA and establish a low-cost paradigm for transforming the existing datasets into unified and high-quality visual instruction tuning data, from which the UNIAA-LLaVA is trained. To further evaluate the IAA capability of MLLMs, we construct the UNIAA-Bench, which consists of three aesthetic levels: Perception, Description, and Assessment. Extensive experiments validate the effectiveness and rationality of UNIAA. UNIAA-LLaVA achieves competitive performance on all levels of UNIAA-Bench, compared with existing MLLMs. Specifically, our model performs better than GPT-4V in aesthetic perception and even approaches the junior-level human. We find MLLMs have great potential in IAA, yet there remains plenty of room for further improvement. The UNIAA-LLaVA and UNIAA-Bench will be released.

composition, dataset, mllm, (14 more...)

arXiv.org Artificial Intelligence

Apr-15-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - United States > Washington
    - King County > Seattle (0.04)
  - Canada > Newfoundland and Labrador
    - Labrador (0.04)
- Asia > China
  - Beijing > Beijing (0.04)
  - Guangdong Province > Shenzhen (0.04)

Genre:
- Research Report (0.81)

Industry:
- Media > Photography (1.00)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (0.93)
  - Artificial Intelligence
    - Vision (1.00)
    - Natural Language > Large Language Model (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found