Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion