UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching
Glazer, Neta, Navon, Aviv, Segal, Yael, Shamsian, Aviv, Segev, Hilit, Buchnick, Asaf, Pirchi, Menachem, Hetz, Gil, Keshet, Joseph
–arXiv.org Artificial Intelligence
Recent advances in Text-to-Speech (TTS) have enabled highly natural speech synthesis, yet integrating speech with complex background environments remains challenging. We introduce UmbraTTS, a flow-matching based TTS model that jointly generates both speech and environmental audio, conditioned on text and acoustic context. Our model allows fine-grained control over background volume and produces diverse, coherent, and context-aware audio scenes. A key challenge is the lack of data with speech and background audio aligned in natural context. To overcome the lack of paired training data, we propose a self-supervised framework that extracts speech, background audio, and transcripts from unannotated recordings. Extensive evaluations demonstrate that UmbraTTS significantly outperformed existing baselines, producing natural, high-quality, environmentally aware audios.
arXiv.org Artificial Intelligence
Jul-14-2025