Cross-Modal Fine-Tuning: Align then Refine