Audio-visual video-to-speech synthesis with synthesized input audio