What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction Sunny Panchal 1 Guillaume Berger 1 Antoine Mercier 1

Neural Information Processing Systems 

Vision-language models have shown impressive progress in recent years. However, existing models are largely limited to turn-based interactions, where each turn must be stepped (i.e., prompted) by the user. Open-ended, asynchronous interactions, where an AI model may proactively deliver timely responses or feedback based on the unfolding situation in real-time, are an open challenge.