WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning