AITopics | Choi, Byoung Jin

Collaborating Authors

Choi, Byoung Jin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance

Kim, Semin, Jeong, Myeonghun, Lee, Hyeonseung, Kim, Minchan, Choi, Byoung Jin, Kim, Nam Soo

arXiv.org Artificial IntelligenceJun-9-2024

In this paper, we propose MakeSinger, a semi-supervised training method for singing voice synthesis (SVS) via classifier-free diffusion guidance. The challenge in SVS lies in the costly process of gathering aligned sets of text, pitch, and audio data. MakeSinger enables the training of the diffusion-based SVS model from any speech and singing voice data regardless of its labeling, thereby enhancing the quality of generated voices with large amount of unlabeled data. At inference, our novel dual guiding mechanism gives text and pitch guidance on the reverse diffusion step by estimating the score of masked input. Experimental results show that the model trained in a semi-supervised manner outperforms other baselines trained only on the labeled data in terms of pronunciation, pitch accuracy and overall quality. Furthermore, we demonstrate that by adding Text-to-Speech (TTS) data in training, the model can synthesize the singing voices of TTS speakers even without their singing voices.

artificial intelligence, machine learning, makesinger, (16 more...)

arXiv.org Artificial Intelligence

2406.05965

Country:

Asia > South Korea (0.14)
Europe > Germany (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Kim, Minchan, Jeong, Myeonghun, Choi, Byoung Jin, Kim, Semin, Lee, Joun Yeop, Kim, Nam Soo

arXiv.org Artificial IntelligenceJan-2-2024

We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discrete semantic tokens obtained from wav2vec2.0 embeddings. For a robust and efficient alignment modeling, we employ a neural transducer named token transducer for the semantic token prediction, benefiting from its hard monotonic alignment constraints. Subsequently, a non-autoregressive (NAR) speech generator efficiently synthesizes waveforms from these semantic tokens. Additionally, a reference speech controls temporal dynamics and acoustic conditions at each stage. This decoupled framework reduces the training complexity of TTS while allowing each stage to focus on semantic and acoustic modeling. Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity, both objectively and subjectively. We also delve into the inference speed and prosody control capabilities of our approach, highlighting the potential of neural transducers in TTS frameworks.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2401.01498

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.87)
(3 more...)

Add feedback

Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction

Kim, Minchan, Jeong, Myeonghun, Choi, Byoung Jin, Lee, Dongjune, Kim, Nam Soo

arXiv.org Artificial IntelligenceNov-8-2023

We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints. The proposed model first generates aligned semantic tokens using the neural transducer, then synthesizes a speech sample from the semantic tokens using a non-autoregressive(NAR) speech generator. This decoupled framework alleviates the training complexity of TTS and allows each stage to focus on 1) linguistic and alignment modeling and 2) fine-grained acoustic modeling, respectively. Experimental results on the zero-shot adaptive TTS show that the proposed model exceeds the baselines in speech quality and speaker similarity via objective and subjective measures. We also investigate the inference speed and prosody controllability of our proposed model, showing the potential of the neural transducer for TTS frameworks.

artificial intelligence, semantic token prediction, transduce and speak, (2 more...)

arXiv.org Artificial Intelligence

2311.02898

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.60)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.60)
Information Technology > Artificial Intelligence > Assistive Technologies (0.60)

Add feedback

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech

Jeong, Myeonghun, Kim, Hyeongju, Cheon, Sung Jun, Choi, Byoung Jin, Kim, Nam Soo

arXiv.org Artificial IntelligenceApr-3-2021

Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness and architectural efficiency. In this work, we propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis. Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps. In order to learn the mel-spectrogram distribution conditioned on the text, we present a likelihood-based optimization method for TTS. Furthermore, to boost up the inference speed, we leverage the accelerated sampling method that allows Diff-TTS to generate raw waveforms much faster without significantly degrading perceptual quality. Through experiments, we verified that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU.

deep learning, diff-tt, speech synthesis, (18 more...)

arXiv.org Artificial Intelligence

2104.01409

Country: Asia > South Korea (0.15)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)

Add feedback