Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering

Open in new window