Robust Self-Supervised Audio-Visual Speech Recognition