Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation