Clue: Cross-modal Coherence Modeling for Caption Generation