Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models