LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation