VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer
Montesinos, Juan F., Kadandale, Venkatesh S., Haro, Gloria
–arXiv.org Artificial Intelligence
This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are available in https://ipcv.github.io/VoViT/
arXiv.org Artificial Intelligence
Jul-19-2022
- Country:
- Europe
- Albania > Fier County (0.04)
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- North America > Mexico
- Gulf of Mexico (0.04)
- Europe
- Genre:
- Research Report > Promising Solution (0.34)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (0.94)
- Speech (0.93)
- Vision (0.69)
- Machine Learning > Neural Networks
- Information Technology > Artificial Intelligence