Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking