Variational Structured Semantic Inference for Diverse Image Captioning