intonation
Thanks all the reviewers for the detailed and thoughtful comments
Thanks all the reviewers for the detailed and thoughtful comments. HMM-based works [1, 2, 3], all of which proposed methods to estimate alignments from unsegmented data. We've not thoroughly explored to improve the duration predictor and simply follow the same We design the grouped 1x1 convolutions to be able to mix channels. For example, to generate a speech of 5.8 Therefore, adopting parallel TTS models significantly improves the sampling speed of end-to-end systems. In Section 5.3, we showed that varying temperature can change We will add a reference about Viterbi training.
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models
Borodin, Kirill, Vasiliev, Nikita, Kudryavtsev, Vasiliy, Maslov, Maxim, Gorodnichev, Mikhail, Rogov, Oleg, Mkrtchian, Grach
This work is still in progress Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation. This paper introduces Balalaika, a novel dataset comprising more than 2,000 hours of studio-quality Russian speech with comprehensive textual annotations, including punctuation and stress markings. Experimental results show that models trained on Balalaika significantly outperform those trained on existing datasets in both speech synthesis and enhancement tasks.
- North America > United States > North Dakota > Grand Forks County > Grand Forks (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- (3 more...)
EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge
Manku, Ruskin Raj, Tang, Yuzhi, Shi, Xingjian, Li, Mu, Smola, Alex
Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on $\textit{EmergentTTS}$, we introduce $\textit{EmergentTTS-Eval}$, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation $\href{https://github.com/boson-ai/EmergentTTS-Eval-public}{code}$ and the $\href{https://huggingface.co/datasets/bosonai/EmergentTTS-Eval}{dataset}$.
- Information Technology > Artificial Intelligence > Speech (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.91)
Voice Conversion with Diverse Intonation using Conditional Variational Auto-Encoder
Suh, Soobin, Ahn, Dabi, Park, Heewoong, Park, Jonghun
V oice conversion is a task of synthesizing an utterance with target speaker's voice while maintaining linguistic information of the source utterance. While a speaker can produce varying utterances from a single script with different intonations, conventional voice conversion models were limited to producing only one result per source input. To overcome this limitation, we propose a novel approach for voice conversion with diverse intonations using conditional variational autoencoder (CV AE). Experiments have shown that the speaker's style feature can be mapped into a latent space with Gaussian distribution. We have also been able to convert voices with more diverse intonation by making the posterior of the latent space more complex with inverse autoregressive flow (IAF). As a result, the converted voice not only has a diversity of intonations, but also has better sound quality than the model without CV AE.
- North America > United States (0.14)
- Asia > South Korea > Seoul > Seoul (0.04)
Spoken Language Intelligence of Large Language Models for Language Learning
Peng, Linkai, Nuchged, Baorian, Gao, Yingming
People have long hoped for a conversational system that can assist in real-life situations, and recent progress on large language models (LLMs) is bringing this idea closer to reality. While LLMs are often impressive in performance, their efficacy in real-world scenarios that demand expert knowledge remains unclear. LLMs are believed to hold the most potential and value in education, especially in the development of Artificial intelligence (AI) based virtual teachers capable of facilitating language learning. Our focus is centered on evaluating the efficacy of LLMs in the realm of education, specifically in the areas of spoken language learning which encompass phonetics, phonology, and second language acquisition. We introduce a new multiple-choice question dataset to evaluate the effectiveness of LLMs in the aforementioned scenarios, including understanding and application of spoken language knowledge. In addition, we investigate the influence of various prompting techniques such as zero- and few-shot method (prepending the question with question-answer exemplars), chain-of-thought (CoT, think step-by-step), in-domain exampler and external tools (Google, Wikipedia). We conducted large-scale evaluation on popular LLMs (20 distinct models) using these methods. We achieved significant performance improvements compared to the zero-shot baseline in the practical questions reasoning (GPT-3.5, 49.1% -> 63.1%; LLaMA2-70B-Chat, 42.2% -> 48.6%). We found that models of different sizes have good understanding of concepts in phonetics, phonology, and second language acquisition, but show limitations in reasoning for real-world problems. Additionally, we also explore preliminary findings on conversational communication.
- North America > United States > Texas > Travis County > Austin (0.14)
- Asia > China > Beijing > Beijing (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > New York > New York County > New York City (0.04)
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis
He, Xiangheng, Chen, Junjie, Zhang, Zixing, Schuller, Björn W.
Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which learns a bank of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > Massachusetts (0.04)
- (7 more...)
Building a Luganda Text-to-Speech Model From Crowdsourced Data
Kagumire, Sulaiman, Katumba, Andrew, Nakatumba-Nabende, Joyce, Quinn, John
Text-to-speech (TTS) development for African languages such as Luganda is still limited, primarily due to the scarcity of high-quality, single-speaker recordings essential for training TTS models. Prior work has focused on utilizing the Luganda Common Voice recordings of multiple speakers aged between 20-49. Although the generated speech is intelligible, it is still of lower quality than the model trained on studio-grade recordings. This is due to the insufficient data preprocessing methods applied to improve the quality of the Common Voice recordings. Furthermore, speech convergence is more difficult to achieve due to varying intonations, as well as background noise. In this paper, we show that the quality of Luganda TTS from Common Voice can improve by training on multiple speakers of close intonation in addition to further preprocessing of the training data. Specifically, we selected six female speakers with close intonation determined by subjectively listening and comparing their voice recordings. In addition to trimming out silent portions from the beginning and end of the recordings, we applied a pre-trained speech enhancement model to reduce background noise and enhance audio quality. We also utilized a pre-trained, non-intrusive, self-supervised Mean Opinion Score (MOS) estimation model to filter recordings with an estimated MOS over 3.5, indicating high perceived quality. Subjective MOS evaluations from nine native Luganda speakers demonstrate that our TTS model achieves a significantly better MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover, for a fair comparison, our model trained on six speakers outperforms models trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This showcases the effectiveness of compensating for the lack of data from one speaker with data from multiple speakers of close intonation to improve TTS quality.
- Africa > Uganda > Central Region > Kampala (0.05)
- Africa > East Africa (0.04)
Exploring Speech Pattern Disorders in Autism using Machine Learning
Hu, Chuanbo, Thrasher, Jacob, Li, Wenqi, Ruan, Mindi, Yu, Xiangxu, Paul, Lynn K, Wang, Shuo, Li, Xin
Diagnosing autism spectrum disorder (ASD) by identifying abnormal speech patterns from examiner-patient dialogues presents significant challenges due to the subtle and diverse manifestations of speech-related symptoms in affected individuals. This study presents a comprehensive approach to identify distinctive speech patterns through the analysis of examiner-patient dialogues. Utilizing a dataset of recorded dialogues, we extracted 40 speech-related features, categorized into frequency, zero-crossing rate, energy, spectral characteristics, Mel Frequency Cepstral Coefficients (MFCCs), and balance. These features encompass various aspects of speech such as intonation, volume, rhythm, and speech rate, reflecting the complex nature of communicative behaviors in ASD. We employed machine learning for both classification and regression tasks to analyze these speech features. The classification model aimed to differentiate between ASD and non-ASD cases, achieving an accuracy of 87.75%. Regression models were developed to predict speech pattern related variables and a composite score from all variables, facilitating a deeper understanding of the speech dynamics associated with ASD. The effectiveness of machine learning in interpreting intricate speech patterns and the high classification accuracy underscore the potential of computational methods in supporting the diagnostic processes for ASD. This approach not only aids in early detection but also contributes to personalized treatment planning by providing insights into the speech and communication profiles of individuals with ASD.
- North America > United States > West Virginia > Monongalia County > Morgantown (0.04)
- North America > United States > New York > Albany County > Albany (0.04)
- North America > United States > Missouri > St. Louis County > St. Louis (0.04)
- (4 more...)
ART: The Alternating Reading Task Corpus for Speech Entrainment and Imitation
Yuan, Zheng, de Jong, Dorina, Beňuš, Štefan, Nguyen, Noël, Feng, Ruitao, Sabo, Róbert, Fadiga, Luciano, D`Ausilio, Alessandro
We introduce the Alternating Reading Task (ART) Corpus, a collection of dyadic sentence reading for studying the entrainment and imitation behaviour in speech communication. The ART corpus features three experimental conditions - solo reading, alternating reading, and deliberate imitation - as well as three sub-corpora encompassing French-, Italian-, and Slovak-accented English. This design allows systematic investigation of speech entrainment in a controlled and less-spontaneous setting. Alongside detailed transcriptions, it includes English proficiency scores, demographics, and in-experiment questionnaires for probing linguistic, personal and interpersonal influences on entrainment. Our presentation covers its design, collection, annotation processes, initial analysis, and future research prospects.
- Africa > South Africa (0.04)
- Europe > Italy (0.04)
- Africa > Lesotho > Maseru > Maseru (0.04)
- (9 more...)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.93)
- Education (0.93)
- Leisure & Entertainment (0.68)
The AI tools that might stop you getting hired
Investigating the use of artificial intelligence (AI) in the world of work, Hilke Schellmann thought she had better try some of the tools. Among them was a one-way video interview system intended to aid recruitment called myInterview. She got a login from the company and began to experiment – first picking the questions she, as the hiring manager, would ask and then video recording her answers as a candidate before the proprietary software analysed the words she used and the intonation of her voice to score how well she fitted the job. She was pleased to score an 83% match for the role. But when she re-did her interview not in English but in her native German, she was surprised to find that instead of an error message she also scored decently (73%) – and this time she hadn't even attempted to answer the questions but read a Wikipedia entry.