VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

Montesinos, Juan F., Kadandale, Venkatesh S., Haro, Gloria

Jul-19-2022–arXiv.org Artificial Intelligence

This paper presents an audio-visual approach for voice separation which produces state-of-the-art results at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights are available in https://ipcv.github.io/VoViT/

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Jul-19-2022

arXiv.org PDF

Add feedback

Country:
- North America > Mexico
  - Gulf of Mexico (0.04)
- Europe
  - Albania > Fier County (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)

Genre:
- Research Report > Promising Solution (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Speech (0.93)
  - Vision (0.69)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found