Supplementary Material for Multi-modal Dependency Tree for Video Captioning