Learning Distinct and Representative Modes for Image Captioning