Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation

Hammoudeh, Ahmad, Vanderplaetse, Bastein, Dupont, Stéphane

Feb-11-2022–arXiv.org Artificial Intelligence

This work aims at generating captions for soccer videos using deep learning. In this context, this paper introduces a dataset, model, and triple-level evaluation. The dataset consists of 22k caption-clip pairs and three visual features (images, optical flow, inpainting) for ~500 hours of \emph{SoccerNet} videos. The model is divided into three parts: a transformer learns language, ConvNets learn vision, and a fusion of linguistic and visual features generates captions. The paper suggests evaluating generated captions at three levels: syntax (the commonly used evaluation metrics such as BLEU-score and CIDEr), meaning (the quality of descriptions for a domain expert), and corpus (the diversity of generated captions). The paper shows that the diversity of generated captions has improved (from 0.07 reaching 0.18) with semantics-related losses that prioritize selected words. Semantics-related losses and the utilization of more visual features (optical flow, inpainting) improved the normalized captioning score by 28\%. The web page of this work: https://sites.google.com/view/soccercaptioning}{https://sites.google.com/view/soccercaptioning

caption, dataset, significant word, (14 more...)

arXiv.org Artificial Intelligence

Feb-11-2022

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.04)
- Europe > Belgium (0.04)

Genre:
- Research Report (1.00)

Industry:
- Leisure & Entertainment > Sports > Soccer (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Vision > Image Understanding (0.75)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)