Self-Annotated Training for Controllable Image Captioning