Supplementary Material for Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective