Vision Transformer Based Model for Describing a Set of Images as a Story