Transferring Textual Preferences to Vision-Language Understanding through Model Merging
Li, Chen-An, Lin, Tzu-Han, Chen, Yun-Nung, Lee, Hung-yi
–arXiv.org Artificial Intelligence
Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.
arXiv.org Artificial Intelligence
Feb-19-2025
- Country:
- Asia
- North America > United States (0.14)
- Genre:
- Research Report > New Finding (0.93)
- Industry:
- Transportation > Air (0.68)
- Technology: