Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation

Open in new window