See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

Wang, Jinting, Wang, Jun, Cheng, Hei Victor, Liu, Li

arXiv.org Artificial Intelligence 

Abstract--Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. T o generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner . Experimental results demonstrate that our method outperforms existing approaches on the HDTF, V oxCeleb, and A VSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input. UDIO-driven talking face generation aims to animate a target portrait image to create realistic talking videos given a driving audio speech. This technique finds wide application in various practical scenarios, including high-quality film and animation production, virtual assistants, interactive educational content creation, and realistic character animation. Recently, significant advancements have been made in this field with the development of generative models. Existing talking face generation methods mainly focus on creating animated videos from a reference portrait [1]-[5]. Still, there is a dilemma: users are concerned about privacy breaches when using real portrait images [6]. FaceChain [6] made the first attempt to liberate the source face and directly infer the synchronized portrait using disentangled identity features from speech. However, the generated virtual face fails to preserve identity consistency.