Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining