Leveraging Generative Language Models for Weakly Supervised Sentence Component Analysis in Video-Language Joint Learning