Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization Y ang Zhao