Zero-shot personalized lip-to-speech synthesis with face image based voice control