Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention