Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering

Open in new window