Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

May-31-2025, 16:12:48 GMT–Neural Information Processing Systems

Modern vision models are trained on very large noisy datasets. While these models acquire strong capabilities, they may not follow the user's intent to output the desired results in certain aspects, e.g., visual aesthetic, preferred style, and responsibility. In this paper, we target the realm of visual aesthetics and aim to align vision models with human aesthetic standards in a retrieval system. Advanced retrieval systems usually adopt a cascade of aesthetic models as re-rankers or filters, which are limited to low-level features like saturation and perform poorly when stylistic, cultural or knowledge contexts are involved. We find that utilizing the reasoning ability of large language models (LLMs) to rephrase the search query and extend the aesthetic expectations can make up for this shortcoming.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

May-31-2025, 16:12:48 GMT

Conferences PDF

Add feedback

Country:
- North America (0.14)

Genre:
- Research Report > Experimental Study (0.93)

Industry:
- Information Technology (0.92)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.46)
    - Natural Language > Large Language Model (1.00)
    - Vision (1.00)
  - Information Management > Search (1.00)