Supervised Visual Attention for Simultaneous Multimodal Machine Translation