Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

Open in new window