Lyrics-to-Audio Alignment by Unsupervised Discovery of Repetitive Patterns in Vowel Acoustics
–arXiv.org Artificial Intelligence
Most of the previous approaches to lyrics-to-audio alignment used a pre-developed automatic speech recognition (ASR) system that innately suffered from several difficulties to adapt the speech model to individual singers. A significant aspect missing in previous works is the self-learnability of repetitive vowel patterns in the singing voice, where the vowel part used is more consistent than the consonant part. Based on this, our system first learns a discriminative subspace of vowel sequences, based on weighted symmetric non-negative matrix factorization (WS-NMF), by taking the self-similarity of a standard acoustic feature as an input. Then, we make use of canonical time warping (CTW), derived from a recent computer vision technique, to find an optimal spatiotemporal transformation between the text and the acoustic sequences. Experiments with Korean and English data sets showed that deploying this method after a pre-developed, unsupervised, singing source separation achieved more promising results than other state-of-the-art unsupervised approaches and an existing ASR-based system.
arXiv.org Artificial Intelligence
Jan-24-2017
- Country:
- North America > United States
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- New York > New York County
- New York City (0.04)
- Pennsylvania > Allegheny County
- Europe > United Kingdom
- England > Oxfordshire > Oxford (0.04)
- Asia
- Middle East > Jordan (0.04)
- South Korea
- Seoul > Seoul (0.05)
- Gyeonggi-do > Suwon (0.04)
- North America > United States
- Genre:
- Research Report (0.82)
- Industry:
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Technology:
- Information Technology
- Data Science > Data Mining (1.00)
- Artificial Intelligence
- Natural Language (1.00)
- Machine Learning > Statistical Learning (0.93)
- Speech > Speech Recognition (0.86)
- Information Technology