AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations