Test-Time Reasoning Through Visual Human Preferences with VLMs and Soft Rewards

Gambashidze, Alexander, Sobolev, Konstantin, Kuznetsov, Andrey, Oseledets, Ivan

arXiv.org Artificial Intelligence 

Can Visual Language Models (VLMs) effectively capture huma n visual preferences? This work addresses this question by training VLMs to think about preferences at test time, employing reinforcement learnin g methods inspired by DeepSeek R1 and OpenAI O1. Using datasets such as ImageRewar d and Human Preference Score v2 (HPSv2), our models achieve accurac ies of 64.9% on the ImageReward test set (trained on ImageReward official sp lit) and 65.4% on HPSv2 (trained on approximately 25% of its data). These resu lts match traditional encoder-based models while providing transparent r easoning and enhanced generalization. This approach allows to use not only rich VL M world knowledge, but also its potential to think, yielding interpretable out comes that help decision-making processes. By demonstrating that human visual prefe rences reasonable by current VLMs, we introduce efficient soft-reward strateg ies for image ranking, outperforming simplistic selection or scoring methods. Th is reasoning capability enables VLMs to rank arbitrary images--regardless of aspect ratio or complexity--thereby potentially amplifying the effectiveness of v isual Preference Optimization. By reducing the need for extensive markup while im proving reward generalization and explainability, our findings can be a str ong mile-stone that will enhance text-to-vision models even further.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found