Supervised Multimodal Bitransformers for Classifying Images and Text