AITopics | phoneme duration

Collaborating Authors

phoneme duration

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FastSpeech: Fast, Robust and Controllable Text to Speech

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Neural Information Processing SystemsFeb-15-2026, 03:49:24 GMT

Prominent methods (e.g., Tacotron 2)usuallyfirst generate mel-spectrogram from text, and then synthesize speech from themel-spectrogram using vocoder such as WaveNet. Compared with traditionalconcatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech isusually not robust (i.e., some words are skipped or repeated) and lack of con-trollability (voice speed or prosody control).

artificial intelligence, fastspeech, machine learning, (15 more...)

Neural Information Processing Systems

Country:

Asia > China (0.05)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Add feedback

Comparative Evaluation of Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2

Rackauckas, Zackary, Hirschberg, Julia

arXiv.org Artificial IntelligenceDec-2-2025

Synthesizing expressive Japanese character speech poses unique challenges due to pitch-accent sensitivity and stylistic variability. This paper empirically evaluates two open-source text-to-speech models--VITS and Style-BERT-VITS2 JP Extra (SBV2JE)--on in-domain, character-driven Japanese speech. Using three character-specific datasets, we evaluate models across naturalness (mean opinion and comparative mean opinion score), intelligibility (word error rate), and speaker consistency. SBV2JE matches human ground truth in naturalness (MOS 4.37 vs. 4.38), achieves lower WER, and shows slight preference in CMOS. Enhanced by pitch-accent controls and a WavLM-based discriminator, SBV2JE proves effective for applications like language learning and character dialogue generation, despite higher computational demands.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.1732

Country: Asia > Japan > Honshū (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.75)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)

Add feedback

FastSpeech: Fast, Robust and Controllable Text to Speech

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Neural Information Processing SystemsAug-20-2025, 09:32:38 GMT

Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of con-trollability (voice speed or prosody control).

fastspeech, mel-spectrogram sequence, sequence, (10 more...)

Neural Information Processing Systems

Country:

Asia > China (0.04)
North America > Canada (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems

Tomashenko, Natalia, Vincent, Emmanuel, Tommasi, Marc

arXiv.org Artificial IntelligenceJul-22-2025

The temporal dynamics of speech, encompassing variations in rhythm, intonation, and speaking rate, contain important and unique information about speaker identity. This paper proposes a new method for representing speaker characteristics by extracting context-dependent duration embeddings from speech temporal dynamics. We develop novel attack models using these representations and analyze the potential vulnerabilities in speaker verification and voice anonymization systems.The experimental results show that the developed attack models provide a significant improvement in speaker verification performance for both original and anonymized data in comparison with simpler representations of speech temporal dynamics reported in the literature.

artificial intelligence, machine learning, utterance, (18 more...)

arXiv.org Artificial Intelligence

2507.15214

Country: Europe > France (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.72)

Add feedback

An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR

Ogun, Sewade, Colotte, Vincent, Vincent, Emmanuel

arXiv.org Artificial IntelligenceMar-11-2025

Augmenting the training data of automatic speech recognition (ASR) systems with synthetic data generated by text-to-speech (TTS) or voice conversion (VC) has gained popularity in recent years. Several works have demonstrated improvements in ASR performance using this augmentation approach. However, because of the lower diversity of synthetic speech, naively combining synthetic and real data often does not yield the best results. In this work, we leverage recently proposed flow-based TTS/VC models allowing greater speech diversity, and assess the respective impact of augmenting various speech attributes on the word error rate (WER) achieved by several ASR models. Pitch augmentation and VC-based speaker augmentation are found to be ineffective in our setup. Jointly augmenting all other attributes reduces the WER of a Conformer-Transducer model by 11\% relative on Common Voice and by up to 35\% relative on LibriSpeech compared to training on real data only.

asr model, augmentation, utterance, (15 more...)

arXiv.org Artificial Intelligence

2503.08954

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization

Tomashenko, Natalia, Vincent, Emmanuel, Tommasi, Marc

arXiv.org Artificial IntelligenceDec-22-2024

Abstract--In this paper, we investigate the impact of speech methods use large-scale pre-trained models for extracting specific temporal dynamics in application to automatic speaker verification attributes and provide better content and privacy preservation than and speaker voice anonymization tasks. We propose several signal processing based methods. The diversity of approaches is metrics to perform automatic speaker verification based only illustrated by the VoicePrivacy 2024 Challenge [10], which provided on phoneme durations. Experimental results demonstrate that six baseline anonymization systems, namely anonymization using x-phoneme durations leak some speaker information and can reveal vectors and a neural source-filter model [6], [11], signal processing speaker identity from both original and anonymized speech. While specific studies have been dedicated to speaker information carried by pitch [5], [6], [8], the impact of speech temporal dynamics on speaker verification and re-identification has been overlooked.

anonymization, artificial intelligence, speech recognition, (15 more...)

arXiv.org Artificial Intelligence

2412.17164

Country:

Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (0.69)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou

Neural Information Processing SystemsOct-4-2024, 05:22:43 GMT

Neural Information Processing Systems http://nips.cc/

deep voice 1, deep voice 2, tacotron, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Sunnyvale (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.68)

Add feedback

Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

Fujita, Kenichi, Ando, Atsushi, Ijima, Yusuke

arXiv.org Artificial IntelligenceFeb-10-2024

This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for reproducing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker embeddings generation, speech synthesis with generated embeddings, and embedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation analysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.

phoneme duration, similarity, utterance, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1587/transinf.2023EDP7039

2402.07085

Country: Asia > Japan (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

BiSinger: Bilingual Singing Voice Synthesis

Zhou, Huali, Lin, Yueqian, Shi, Yao, Sun, Peng, Li, Ming

arXiv.org Artificial IntelligenceJan-9-2024

Although Singing Voice Synthesis (SVS) has made great strides with Text-to-Speech (TTS) techniques, multilingual singing voice modeling remains relatively unexplored. This paper presents BiSinger, a bilingual pop SVS system for English and Chinese Mandarin. Current systems require separate models per language and cannot accurately represent both Chinese and English, hindering code-switch SVS. To address this gap, we design a shared representation between Chinese and English singing voices, achieved by using the CMU dictionary with mapping rules. We fuse monolingual singing datasets with open-source singing voice conversion techniques to generate bilingual singing voices while also exploring the potential use of bilingual speech data. Experiments affirm that our language-independent representation and incorporation of related datasets enable a single model with enhanced performance in English and code-switch SVS while maintaining Chinese song performance. Audio samples are available at https://bisinger-svs.github.io.

dataset, phoneme, synthesis, (16 more...)

arXiv.org Artificial Intelligence

2309.14089

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > China > Hubei Province > Wuhan (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > California > San Francisco County > San Francisco (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Media > Music (0.70)
Leisure & Entertainment (0.70)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Controllable Emphasis with zero data for text-to-speech

Joly, Arnaud, Nicolis, Marco, Peterova, Ekaterina, Lombardi, Alessandro, Abbas, Ammar, van Korlaar, Arent, Hussain, Aman, Sharma, Parul, Moinet, Alexis, Lajszczak, Mateusz, Karanasou, Penny, Bonafonte, Antonio, Drugman, Thomas, Sokolova, Elena

arXiv.org Artificial IntelligenceJul-13-2023

A popular approach consists in recording a smaller dataset featuring the desired emphasis effect in addition to the main We present a scalable method to produce high quality emphasis'neutral' recordings, and having the model learn the particular for text-to-speech (TTS) that does not require recordings or prosody associated with the emphasized words (see [5, 6, 7, 8] annotations. Many TTS models include a phoneme duration for recent examples). We build one such model as our upper model. A simple but effective method to achieve emphasized anchor, as detailed in section 2.1 speech consists in increasing the predicted duration of the emphasised While this technique works well for the speaker for which word. We show that this is significantly better than'emphasis recordings' are available, it does not directly scale spectrogram modification techniques improving naturalness by to new speakers or different languages. An alternative technique 7.3% and correct testers' identification of the emphasized word adopted with varying degrees of success consists in annotating in a sentence by 40% on a reference female en-US voice.

emphasis, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2307.07062

Country:

Asia > South Korea > Incheon > Incheon (0.05)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom (0.04)
Europe > France (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.72)

Add feedback