Appendix of Prophet Attention

Neural Information Processing Systems 

CIDEr-c40, which is the default ranking score in the leaderboard, and rank the 1st. Compared with image captioning, the target of video captioning is the video clip, i.e., an ordered The dataset contain 10,000 video clips, and each video is paired with 20 annotated sentences. We use the official splits to report our results. CIDEr, which is built upon on n-gram matching, is used in our tests for performance evaluation. All re-implementations and our experiments were ran on V100 GPUs.