Audio-visual training for improved grounding in video-text LLMs

Open in new window