I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition