Multi-modal reward for visual relationships-based image captioning