Cross-modal Causal Relation Alignment for Video Question Grounding