Towards Multilingual Audio-Visual Question Answering