Multimodal Large Language Models Make Text-to-Image Generative Models Align Better

May-31-2025, 12:02:06 GMT–Neural Information Processing Systems

Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, making it to generate more human-preferred images. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges, we first leverage multimodal large language models to create VisionPrefer, a fine-grained preference dataset that captures multiple preference aspects (prompt-following, aesthetic, fidelity, and harmlessness). Then we train a corresponding reward model, VP-Score, over VisionPrefer to guide the tuning of text-to-image generative models. The preference prediction accuracy of VP-Score is validated to be comparable to that of human annotators.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

May-31-2025, 12:02:06 GMT

Conferences PDF

Add feedback

Country:
- Asia > Japan > Honshū > Kantō (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Media > Photography (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Generation (1.00)
    - Large Language Model (0.87)
  - Vision (1.00)