Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

Open in new window