With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You