Jointly Learning Visual and Auditory Speech Representations from Raw Data