pronunciation
CoreaSpeech: Korean Speech Corpus via Jamo-based Coreset Selection for Efficient and Robust Korean Speech Generation
While substantial advances have been achieved in TTS for languages such as English and Mandarin, Korean remains comparatively underrepresented due to the lack of rigorous preprocessing methods, systematically constructed datasets, a shortage of standardized Korean TTS benchmarks, and explicitly optimized models for Korean. To address these limitations, we propose a Korean-tailored data-refinement and coreset selection pipeline. It refines speech data and performs textual normalization especially for numerals and English terms, followed by a novel coreset selection strategy that leverages Jamo-based linguistic and phonological features unique to Korean. As a result, we release CoreaSpeech, an efficient and robust Korean speech corpus comprising 700 hours across 21,449 speakers. This refined core subset, evenly balanced across utterances ranging from 0 to 30 seconds, is derived from 2,058 hours of widely used Korean datasets. Building on this, we conducted extensive experiments via cross-lingual fine-tuning with our CoreaSpeech dataset. Furthermore, we introduce a new universal Korean TTS benchmark dataset including clean, noisy, and numeric subsets. Additionally, we demonstrate that our Korean-specific text normalization serves as a plug-and-play module, reliably improving performance regardless of the underlying TTS architecture.
E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis
Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs.
69c754f571806bf15add18556ff39b4f-Supplemental-Conference.pdf
Similar to the previous analysis of XLSR-53 (Choi et al., 2021), the representations from the 1st layer of XLS-R are already clustered by each speaker while it is hard to distinguish the representations of thelatterlayerbyeachspeaker. HierSpeech-UVCTK+LibriTTS (20) 3.71 15.85 6.40 4.09 30.64Untranscribed text-to-speech We describe the results of the objective evaluation for speaker adaptationinTable11. Hence, the data augmentation for speech disentanglement is not necessaryinourmethod. Note that we fail to train the model with the representations from the 23th layer of XLS-R. We train Tacotron 2 with batch size of 256 for 100k steps.
A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation
Al-Kharusi, Mohammed Hilal, Hayat, Khizar, Ruqeishi, Khalil Bader Al, Lone, Haroon Rashid
The art and science of Quranic recitation (Tajweed), a discipline governed by meticulous phonetic, rhythmic, and theological principles, confronts substantial educational challenges in today's digital age. Although modern technology offers unparalleled opportunities for learning, existing automated systems for evaluating recitation have struggled to gain broad acceptance or demonstrate educational effectiveness. This literature review examines this crucial disparity, offering a thorough analysis of scholarly research, digital platforms, and commercial tools developed over the past twenty years. Our analysis uncovers a fundamental flaw in current approaches that adapt Automatic Speech Recognition (ASR) systems, which emphasize word identification over qualitative acoustic evaluation. These systems suffer from limitations such as reliance on biased datasets, demographic disparities, and an inability to deliver meaningful feedback for improvement. Challenging these data-centric methodologies, we advocate for a paradigm shift toward a knowledge-based computational framework. By leveraging the unchanging nature of the Quranic text and the well-defined rules of Tajweed, we propose that an effective evaluation system should be built upon rule-based acoustic modeling centered on canonical pronunciation principles and articulation points (Makhraj), rather than depending on statistical patterns derived from flawed or biased data. The review concludes that the future of automated Quranic recitation assessment lies in hybrid systems that combine linguistic expertise with advanced audio processing. Such an approach paves the way for developing reliable, fair, and pedagogically effective tools that can authentically assist learners across the globe.
E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis
Zhang, Zhisheng, Wang, Derui, Mi, Yifan, Wu, Zhiyong, Gao, Jie, Cao, Yuxin, Ye, Kai, Xue, Minhui, Hao, Jie
Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs. Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. Specifically, we employ the encoder ensemble with a feature extractor to protect timbre, while ASR-targeted adversarial examples disrupt pronunciation. Moreover, we incorporate the psychoacoustic model to ensure perturbative imperceptibility. For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard's effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/.
Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech
Kotoge, Rikuto, Sasaki, Yuichi
Aligning text-to-speech (TTS) system outputs with human feedback through preference optimization has been shown to effectively improve the robustness and naturalness of language model-based TTS models. Current approaches primarily require paired desirable and undesirable samples at the utterance level. However, such pairs are often limited in TTS output data, and utterance-level formulation prevents fine-grained token-level optimization needed for accurate pronunciation alignment. In this study, we propose TKTO that eliminates the need for paired data, enabling a more data-efficient training paradigm, and directly targets token-level units, automatically providing fine-grained alignment signals without token-level annotations. TKTO improves the challenging Japanese TTS accuracy by 39% and reduces CER by 54%, automatically assigning 12.8 times stronger reward to targeted tokens.
OLaPh: Optimal Language Phonemizer
Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.
UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech
ABSTRACT We propose UtterT une, a lightweight adaptation method that fine-tunes a multilingual text-to-speech (TTS) system based on a large language model (LLM) architecture, designed to enhance the controllability of pronunciation in a target language while preserving performance in others. While LLM architectures have enabled TTS models to achieve remarkable naturalness, accurately modeling grapheme-to-phoneme (G2P) mapping and prosody remains challenging, especially when the model omits an explicit G2P module and directly processes minimally encoded text (e.g., byte-pair encoding). UtterTune leverages low-rank adaptation to enable the control of segmental pronunciation and pitch accent at the phoneme level for Japanese speech, the target language in this paper, while maintaining naturalness and speaker similarity in a zero-shot setting. Objective and subjective evaluations confirm its effectiveness. Index T erms-- text-to-speech, large language model, low-rank adaptation, pronunciation, controllability 1. INTRODUCTION Text-to-speech (TTS) models based on large language model (LLM) architecture (LLM-TTS in this paper) have demonstrated exceptional naturalness, especially in zero-shot multi-speaker and multilingual synthesis, leading the way in speech synthesis technology [1, 2, 3, 4]; however, reproducing accurate pronunciation remains challenging. Some multilingual LLM-TTS, such as CosyV oice 2 [4], are designed to take raw text (characters) as input and tokenize it via byte-pair encoding (BPE) [5], without explicit phonemic or prosody markers. This design contrasts with conventional neural sequence-to-sequence TTS, which typically converts input text into phonemes (grapheme-to-phoneme; G2P) and prosody information, if needed, using a text frontend before feeding it into the model [6, 7, 8, 1]. On the other hand, such models require a large amount of speech-text pairs that cover the diversity of the target language because they predict segmental pronunciation and prosody data-driven.
'I love you too!' My family's creepy, unsettling week with an AI toy
'Let's talk about something fun!' Grem the AI chatbot toy. 'Let's talk about something fun!' Grem the AI chatbot toy. 'I love you too!' My family's creepy, unsettling week with an AI toy The cuddly chatbot Grem is designed to'learn' your child's personality, while every conversation they have is recorded, then transcribed by a third party. It wasn't long before I wanted this experiment to be over ... 'I'm going to throw that thing into a river!" my wife says as she comes down the stairs looking frazzled after putting our four-year-old daughter to bed. To be clear, "that thing" is not our daughter, Emma*. It's Grem, an AI-powered stuffed alien toy that the musician Claire Boucher, better known as Grimes, helped develop with toy company Curio. Designed for kids aged three and over and built with OpenAI's technology, the toy is supposed to "learn" your child's personality and have fun, educational conversations with them. It's advertised as a healthier alternative to screen time and is ...