SeeingSounds: Learning Audio-to-Visual Alignment via Text