Transferring Textual Preferences to Vision-Language Understanding through Model Merging