Language-Guided Audio-Visual Source Separation via Trimodal Consistency