Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification