Joint Multimodal Transformer for Emotion Recognition in the Wild