Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations