Multi-modalDependencyTreeforVideoCaptioning

Neural Information Processing Systems 

Manyexisting methods generate captions using sequence models that predict words in a left-to-right order.