Improving Multimodal Accuracy Through Modality Pre-training and Attention