COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations