AITopics | intonation

Collaborating Authors

intonation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Thanks all the reviewers for the detailed and thoughtful comments

Neural Information Processing SystemsOct-3-2025, 00:27:52 GMT

Thanks all the reviewers for the detailed and thoughtful comments. HMM-based works [1, 2, 3], all of which proposed methods to estimate alignments from unsegmented data. We've not thoroughly explored to improve the duration predictor and simply follow the same We design the grouped 1x1 convolutions to be able to mix channels. For example, to generate a speech of 5.8 Therefore, adopting parallel TTS models significantly improves the sampling speed of end-to-end systems. In Section 5.3, we showed that varying temperature can change We will add a reference about Viterbi training.

artificial intelligence, machine learning, tacotron 2, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.36)

Add feedback

A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models

Borodin, Kirill, Vasiliev, Nikita, Kudryavtsev, Vasiliy, Maslov, Maxim, Gorodnichev, Mikhail, Rogov, Oleg, Mkrtchian, Grach

arXiv.org Artificial IntelligenceJul-21-2025

This work is still in progress Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation. This paper introduces Balalaika, a novel dataset comprising more than 2,000 hours of studio-quality Russian speech with comprehensive textual annotations, including punctuation and stress markings. Experimental results show that models trained on Balalaika significantly outperform those trained on existing datasets in both speech synthesis and enhancement tasks.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.13563

Country:

North America > United States > North Dakota (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.89)

Industry: Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge

Manku, Ruskin Raj, Tang, Yuzhi, Shi, Xingjian, Li, Mu, Smola, Alex

arXiv.org Artificial IntelligenceMay-30-2025

Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on $\textit{EmergentTTS}$, we introduce $\textit{EmergentTTS-Eval}$, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation $\href{https://github.com/boson-ai/EmergentTTS-Eval-public}{code}$ and the $\href{https://huggingface.co/datasets/bosonai/EmergentTTS-Eval}{dataset}$.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2505.23009

Genre: Research Report > New Finding (0.65)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.91)

Add feedback

Voice Conversion with Diverse Intonation using Conditional Variational Auto-Encoder

Suh, Soobin, Ahn, Dabi, Park, Heewoong, Park, Jonghun

arXiv.org Artificial IntelligenceApr-17-2025

V oice conversion is a task of synthesizing an utterance with target speaker's voice while maintaining linguistic information of the source utterance. While a speaker can produce varying utterances from a single script with different intonations, conventional voice conversion models were limited to producing only one result per source input. To overcome this limitation, we propose a novel approach for voice conversion with diverse intonations using conditional variational autoencoder (CV AE). Experiments have shown that the speaker's style feature can be mapped into a latent space with Gaussian distribution. We have also been able to convert voices with more diverse intonation by making the posterior of the latent space more complex with inverse autoregressive flow (IAF). As a result, the converted voice not only has a diversity of intonations, but also has better sound quality than the model without CV AE.

artificial intelligence, machine learning, utterance, (15 more...)

arXiv.org Artificial Intelligence

2504.12005

Country:

North America > United States (0.14)
Asia (0.14)

Genre: Research Report > Promising Solution (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Spoken Language Intelligence of Large Language Models for Language Learning

Peng, Linkai, Nuchged, Baorian, Gao, Yingming

arXiv.org Artificial IntelligenceFeb-4-2025

People have long hoped for a conversational system that can assist in real-life situations, and recent progress on large language models (LLMs) is bringing this idea closer to reality. While LLMs are often impressive in performance, their efficacy in real-world scenarios that demand expert knowledge remains unclear. LLMs are believed to hold the most potential and value in education, especially in the development of Artificial intelligence (AI) based virtual teachers capable of facilitating language learning. Our focus is centered on evaluating the efficacy of LLMs in the realm of education, specifically in the areas of spoken language learning which encompass phonetics, phonology, and second language acquisition. We introduce a new multiple-choice question dataset to evaluate the effectiveness of LLMs in the aforementioned scenarios, including understanding and application of spoken language knowledge. In addition, we investigate the influence of various prompting techniques such as zero- and few-shot method (prepending the question with question-answer exemplars), chain-of-thought (CoT, think step-by-step), in-domain exampler and external tools (Google, Wikipedia). We conducted large-scale evaluation on popular LLMs (20 distinct models) using these methods. We achieved significant performance improvements compared to the zero-shot baseline in the practical questions reasoning (GPT-3.5, 49.1% -> 63.1%; LLaMA2-70B-Chat, 42.2% -> 48.6%). We found that models of different sizes have good understanding of concepts in phonetics, phonology, and second language acquisition, but show limitations in reasoning for real-world problems. Additionally, we also explore preliminary findings on conversational communication.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2308.14536

Country:

North America > United States > Texas > Travis County > Austin (0.14)
Asia > China > Beijing > Beijing (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (1.00)

Industry: Education > Curriculum > Subject-Specific Education (0.91)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

He, Xiangheng, Chen, Junjie, Zhang, Zixing, Schuller, Björn W.

arXiv.org Artificial IntelligenceDec-19-2024

Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which learns a bank of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.11795

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > Canada > Quebec > Montreal (0.04)
North America > United States > Massachusetts (0.04)
(7 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.91)

Add feedback

Building a Luganda Text-to-Speech Model From Crowdsourced Data

Kagumire, Sulaiman, Katumba, Andrew, Nakatumba-Nabende, Joyce, Quinn, John

arXiv.org Artificial IntelligenceMay-16-2024

Text-to-speech (TTS) development for African languages such as Luganda is still limited, primarily due to the scarcity of high-quality, single-speaker recordings essential for training TTS models. Prior work has focused on utilizing the Luganda Common Voice recordings of multiple speakers aged between 20-49. Although the generated speech is intelligible, it is still of lower quality than the model trained on studio-grade recordings. This is due to the insufficient data preprocessing methods applied to improve the quality of the Common Voice recordings. Furthermore, speech convergence is more difficult to achieve due to varying intonations, as well as background noise. In this paper, we show that the quality of Luganda TTS from Common Voice can improve by training on multiple speakers of close intonation in addition to further preprocessing of the training data. Specifically, we selected six female speakers with close intonation determined by subjectively listening and comparing their voice recordings. In addition to trimming out silent portions from the beginning and end of the recordings, we applied a pre-trained speech enhancement model to reduce background noise and enhance audio quality. We also utilized a pre-trained, non-intrusive, self-supervised Mean Opinion Score (MOS) estimation model to filter recordings with an estimated MOS over 3.5, indicating high perceived quality. Subjective MOS evaluations from nine native Luganda speakers demonstrate that our TTS model achieves a significantly better MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover, for a fair comparison, our model trained on six speakers outperforms models trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This showcases the effectiveness of compensating for the lack of data from one speaker with data from multiple speakers of close intonation to improve TTS quality.

female speaker, intonation, multiple speaker, (12 more...)

arXiv.org Artificial Intelligence

2405.10211

Country:

Africa > Uganda > Central Region > Kampala (0.05)
Africa > East Africa (0.04)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.62)

Add feedback

Exploring Speech Pattern Disorders in Autism using Machine Learning

Hu, Chuanbo, Thrasher, Jacob, Li, Wenqi, Ruan, Mindi, Yu, Xiangxu, Paul, Lynn K, Wang, Shuo, Li, Xin

arXiv.org Artificial IntelligenceMay-2-2024

Diagnosing autism spectrum disorder (ASD) by identifying abnormal speech patterns from examiner-patient dialogues presents significant challenges due to the subtle and diverse manifestations of speech-related symptoms in affected individuals. This study presents a comprehensive approach to identify distinctive speech patterns through the analysis of examiner-patient dialogues. Utilizing a dataset of recorded dialogues, we extracted 40 speech-related features, categorized into frequency, zero-crossing rate, energy, spectral characteristics, Mel Frequency Cepstral Coefficients (MFCCs), and balance. These features encompass various aspects of speech such as intonation, volume, rhythm, and speech rate, reflecting the complex nature of communicative behaviors in ASD. We employed machine learning for both classification and regression tasks to analyze these speech features. The classification model aimed to differentiate between ASD and non-ASD cases, achieving an accuracy of 87.75%. Regression models were developed to predict speech pattern related variables and a composite score from all variables, facilitating a deeper understanding of the speech dynamics associated with ASD. The effectiveness of machine learning in interpreting intricate speech patterns and the high classification accuracy underscore the potential of computational methods in supporting the diagnostic processes for ASD. This approach not only aids in early detection but also contributes to personalized treatment planning by providing insights into the speech and communication profiles of individuals with ASD.

autism spectrum disorder, disorder, speech feature, (13 more...)

arXiv.org Artificial Intelligence

2405.05126

Country:

North America > United States > West Virginia > Monongalia County > Morgantown (0.04)
North America > United States > New York > Albany County > Albany (0.04)
North America > United States > Missouri > St. Louis County > St. Louis (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology > Autism (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.89)

Add feedback

ART: The Alternating Reading Task Corpus for Speech Entrainment and Imitation

Yuan, Zheng, de Jong, Dorina, Beňuš, Štefan, Nguyen, Noël, Feng, Ruitao, Sabo, Róbert, Fadiga, Luciano, D`Ausilio, Alessandro

arXiv.org Artificial IntelligenceApr-3-2024

We introduce the Alternating Reading Task (ART) Corpus, a collection of dyadic sentence reading for studying the entrainment and imitation behaviour in speech communication. The ART corpus features three experimental conditions - solo reading, alternating reading, and deliberate imitation - as well as three sub-corpora encompassing French-, Italian-, and Slovak-accented English. This design allows systematic investigation of speech entrainment in a controlled and less-spontaneous setting. Alongside detailed transcriptions, it includes English proficiency scores, demographics, and in-experiment questionnaires for probing linguistic, personal and interpersonal influences on entrainment. Our presentation covers its design, collection, annotation processes, initial analysis, and future research prospects.

entrainment, imitation, lesotho, (16 more...)

arXiv.org Artificial Intelligence

2404.0271

Country:

Africa > South Africa (0.04)
Europe > Italy (0.04)
Africa > Lesotho > Maseru > Maseru (0.04)
(9 more...)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Industry:

Education (0.93)
Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (0.68)
Information Technology > Artificial Intelligence > Speech (0.67)

Add feedback

The AI tools that might stop you getting hired

The GuardianFeb-3-2024, 13:00:17 GMT

Investigating the use of artificial intelligence (AI) in the world of work, Hilke Schellmann thought she had better try some of the tools. Among them was a one-way video interview system intended to aid recruitment called myInterview. She got a login from the company and began to experiment – first picking the questions she, as the hiring manager, would ask and then video recording her answers as a candidate before the proprietary software analysed the words she used and the intonation of her voice to score how well she fitted the job. She was pleased to score an 83% match for the role. But when she re-did her interview not in English but in her native German, she was surprised to find that instead of an error message she also scored decently (73%) – and this time she hadn't even attempted to answer the questions but read a Wikipedia entry.

employer, intonation, schellmann, (8 more...)

The Guardian

Country: North America > United States > New York (0.05)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence (1.00)

Add feedback