Towards Multimodal Social Conversations with Robots: Using Vision-Language Models
Janssens, Ruben, Belpaeme, Tony
–arXiv.org Artificial Intelligence
-- Large language models have given social robots the ability to autonomously engage in open-domain conversations. However, they are still missing a fundamental social skill: making use of the multiple modalities that carry social interactions. While previous work has focused on task-oriented interactions that require referencing the environment or specific phenomena in social interactions such as dialogue breakdowns, we outline the overall needs of a multimodal system for social conversations with robots. We then argue that vision-language models are able to process this wide range of visual information in a sufficiently general manner for autonomous social robots. We describe how to adapt them to this setting, which technical challenges remain, and briefly discuss evaluation practices.
arXiv.org Artificial Intelligence
Aug-19-2025
- Country:
- Europe > Belgium > Flanders > East Flanders > Ghent (0.04)
- Genre:
- Research Report (0.40)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (0.47)
- Natural Language > Large Language Model (1.00)
- Robots (1.00)
- Vision (1.00)
- Machine Learning > Neural Networks
- Information Technology > Artificial Intelligence