AITopics | Saeki, Takaaki

Collaborating Authors

Saeki, Takaaki

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

YODAS: Youtube-Oriented Dataset for Audio and Speech

Li, Xinjian, Takamichi, Shinnosuke, Saeki, Takaaki, Chen, William, Shiota, Sayaka, Watanabe, Shinji

arXiv.org Artificial IntelligenceJun-2-2024

In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets are apt for self-supervised learning applications. YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license. We introduce the collection methodology utilized for YODAS, which contributes to the large-scale speech dataset construction. Subsequently, we provide a comprehensive analysis of speech, text contained within the dataset. Finally, we describe the speech recognition baselines over the top-15 languages.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.00899

Country:

Asia > Japan (0.14)
North America > United States (0.14)
Europe > Germany (0.14)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

Matsunaga, Yuta, Saeki, Takaaki, Takamichi, Shinnosuke, Saruwatari, Hiroshi

arXiv.org Artificial IntelligenceSep-19-2023

We present a comprehensive empirical study for personalized spontaneous speech synthesis on the basis of linguistic knowledge. With the advent of voice cloning for reading-style speech synthesis, a new voice cloning paradigm for human-like and spontaneous speech synthesis is required. We, therefore, focus on personalized spontaneous speech synthesis that can clone both the individual's voice timbre and speech disfluency. Specifically, we deal with filled pauses, a major source of speech disfluency, which is known to play an important role in speech generation and communication in psychology and linguistics. To comparatively evaluate personalized filled pause insertion and non-personalized filled pause prediction methods, we developed a speech synthesis method with a non-personalized external filled pause predictor trained with a multi-speaker corpus. The results clarify the position-word entanglement of filled pauses, i.e., the necessity of precisely predicting positions for naturalness and the necessity of precisely predicting words for individuality on the evaluation of synthesized speech.

artificial intelligence, machine learning, speech, (18 more...)

arXiv.org Artificial Intelligence

2210.07559

Country:

North America > United States > California (0.14)
Asia > Japan (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

Saeki, Takaaki, Maiti, Soumi, Li, Xinjian, Watanabe, Shinji, Takamichi, Shinnosuke, Saruwatari, Hiroshi

arXiv.org Artificial IntelligenceMay-27-2023

While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2301.12596

Country: Asia > Japan (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.72)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.71)

Add feedback

Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

Saeki, Takaaki, Zen, Heiga, Chen, Zhehuai, Morioka, Nobuyuki, Wang, Gary, Zhang, Yu, Bapna, Ankur, Rosenberg, Andrew, Ramabhadran, Bhuvana

arXiv.org Artificial IntelligenceMar-15-2023

Although This paper proposes Virtuoso, a massively multilingual speech-text various approaches of massively multilingual self/semi-supervised joint semi-supervised learning framework for text-to-speech synthesis learning have been attempted for speech recognition tasks, they have (TTS) models. Existing multilingual TTS typically supports tens not been fully explored for multilingual speech generation tasks. of languages, which are a small fraction of the thousands of languages This paper proposes Virtuoso, a massive multilingual speech-in the world. One difficulty to scale multilingual TTS to hundreds of text joint pretraining framework based on self-supervised and semisupervised languages is collecting high-quality speech-text paired data in lowresource learning. It extends Maestro [6], a speech-text semisupervised languages. This study extends Maestro, a speech-text joint pretraining framework for ASR, to speech generation pretraining framework for automatic speech recognition (ASR), to tasks. Virtuoso allows us to pretrain a multilingual TTS model using speech generation tasks. To train a TTS model from various types unsupervised (untranscribed speech and unspoken text) and supervised of speech and text data, different training schemes are designed to (paired TTS and ASR data) datasets with training schemes handle supervised (paired TTS and ASR data) and unsupervised designed for them, which will allow the model to scale to hundreds (untranscribed speech and unspoken text) datasets.

artificial intelligence, machine learning, tts data, (18 more...)

arXiv.org Artificial Intelligence

2210.15447

Country: Asia > Japan (0.29)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.94)

Add feedback

Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

Yang, Dong, Koriyama, Tomoki, Saito, Yuki, Saeki, Takaaki, Xin, Detai, Saruwatari, Hiroshi

arXiv.org Artificial IntelligenceFeb-27-2023

Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phrasing models ignore various speakers' different styles of inserting silent pauses, which can degrade the performance of the model trained on a multi-speaker speech corpus. To this end, we propose more powerful pause insertion frameworks based on a pre-trained language model. Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus, injecting speaker embedding to capture various speaker characteristics. We also leverage duration-aware pause insertion for more natural multi-speaker TTS. We develop and evaluate two types of models. The first improves conventional phrasing models on the position prediction of respiratory pauses (RPs), i.e., silent pauses at word transitions without punctuation. It performs speaker-conditioned RP prediction considering contextual information and is used to demonstrate the effect of speaker information on the prediction. The second model is further designed for phoneme-based TTS models and performs duration-aware pause insertion, predicting both RPs and punctuation-indicated pauses (PIPs) that are categorized by duration. The evaluation results show that our models improve the precision and recall of pause insertion and the rhythm of synthetic speech.

machine learning, natural language, prediction, (19 more...)

arXiv.org Artificial Intelligence

2302.13652

Country:

Asia (0.93)
Europe > Austria (0.46)
North America > United States > California (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)

Add feedback

SpeechLMScore: Evaluating speech generation using speech language model

Maiti, Soumi, Peng, Yifan, Saeki, Takaaki, Watanabe, Shinji

arXiv.org Artificial IntelligenceDec-8-2022

While human evaluation is the most reliable metric for evaluating speech generation systems, it is generally costly and time-consuming. Previous studies on automatic speech quality assessment address the problem by predicting human evaluation scores with machine learning models. However, they rely on supervised learning and thus suffer from high annotation costs and domain-shift problems. We propose SpeechLMScore, an unsupervised metric to evaluate generated speech using a speech-language model. SpeechLMScore computes the average log-probability of a speech signal by mapping it into discrete tokens and measures the average probability of generating the sequence of tokens. Therefore, it does not require human annotation and is a highly scalable framework. Evaluation results demonstrate that the proposed metric shows a promising correlation with human evaluation scores on different speech generation tasks including voice conversion, text-to-speech, and speech enhancement.

machine learning, natural language, speechlmscore, (19 more...)

arXiv.org Artificial Intelligence

2212.04559

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.55)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback