Audio-Driven Talking Face Video Generation with Joint Uncertainty Learning