Multi-scale Hierarchical Residual Network for Dense Captioning