Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling