Remodeling Semantic Relationships in Vision-Language Fine-Tuning