Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering