Unifying Vision-Language Latents for Zero-label Image Caption Enhancement