Stack-Captioning: Coarse-to-Fine Learning for Image Captioning