Semi-supervised multimodal coreference resolution in image narrations