Bridging Vision Language Models and Symbolic Grounding for Video Question Answering

Open in new window