Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation