Towards Multimodal Social Conversations with Robots: Using Vision-Language Models