Multimodal Representation Learning Conditioned on Semantic Relations