Natural Language Generation from Visual Sequences: Challenges and Future Directions