Test-Time Reasoning Through Visual Human Preferences with VLMs and Soft Rewards
Gambashidze, Alexander, Sobolev, Konstantin, Kuznetsov, Andrey, Oseledets, Ivan
–arXiv.org Artificial Intelligence
Can Visual Language Models (VLMs) effectively capture huma n visual preferences? This work addresses this question by training VLMs to think about preferences at test time, employing reinforcement learnin g methods inspired by DeepSeek R1 and OpenAI O1. Using datasets such as ImageRewar d and Human Preference Score v2 (HPSv2), our models achieve accurac ies of 64.9% on the ImageReward test set (trained on ImageReward official sp lit) and 65.4% on HPSv2 (trained on approximately 25% of its data). These resu lts match traditional encoder-based models while providing transparent r easoning and enhanced generalization. This approach allows to use not only rich VL M world knowledge, but also its potential to think, yielding interpretable out comes that help decision-making processes. By demonstrating that human visual prefe rences reasonable by current VLMs, we introduce efficient soft-reward strateg ies for image ranking, outperforming simplistic selection or scoring methods. Th is reasoning capability enables VLMs to rank arbitrary images--regardless of aspect ratio or complexity--thereby potentially amplifying the effectiveness of v isual Preference Optimization. By reducing the need for extensive markup while im proving reward generalization and explainability, our findings can be a str ong mile-stone that will enhance text-to-vision models even further.
arXiv.org Artificial Intelligence
Mar-25-2025
- Country:
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.06)
- Genre:
- Research Report > New Finding (0.34)
- Technology: