Jointly Learning Visual and Auditory Speech Representations from Raw Data

Open in new window