Weakly-supervised Audio Separation via Bi-modal Semantic Similarity