Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

Open in new window