Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
–Neural Information Processing Systems
Modern vision models are trained on very large noisy datasets. While these models acquire strong capabilities, they may not follow the user's intent to output the desired results in certain aspects, e.g., visual aesthetic, preferred style, and responsibility. In this paper, we target the realm of visual aesthetics and aim to align vision models with human aesthetic standards in a retrieval system. Advanced retrieval systems usually adopt a cascade of aesthetic models as re-rankers or filters, which are limited to low-level features like saturation and perform poorly when stylistic, cultural or knowledge contexts are involved. We find that utilizing the reasoning ability of large language models (LLMs) to rephrase the search query and extend the aesthetic expectations can make up for this shortcoming.
Neural Information Processing Systems
May-31-2025, 16:12:48 GMT
- Country:
- North America (0.14)
- Genre:
- Research Report > Experimental Study (0.93)
- Industry:
- Information Technology (0.92)
- Technology: