pronunciation
- Europe > Czechia > South Moravian Region > Brno (0.04)
- Asia > China (0.04)
- South America > Suriname > North Atlantic Ocean (0.04)
- (5 more...)
- Information Technology > Artificial Intelligence > Speech (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation
Al-Kharusi, Mohammed Hilal, Hayat, Khizar, Ruqeishi, Khalil Bader Al, Lone, Haroon Rashid
The art and science of Quranic recitation (Tajweed), a discipline governed by meticulous phonetic, rhythmic, and theological principles, confronts substantial educational challenges in today's digital age. Although modern technology offers unparalleled opportunities for learning, existing automated systems for evaluating recitation have struggled to gain broad acceptance or demonstrate educational effectiveness. This literature review examines this crucial disparity, offering a thorough analysis of scholarly research, digital platforms, and commercial tools developed over the past twenty years. Our analysis uncovers a fundamental flaw in current approaches that adapt Automatic Speech Recognition (ASR) systems, which emphasize word identification over qualitative acoustic evaluation. These systems suffer from limitations such as reliance on biased datasets, demographic disparities, and an inability to deliver meaningful feedback for improvement. Challenging these data-centric methodologies, we advocate for a paradigm shift toward a knowledge-based computational framework. By leveraging the unchanging nature of the Quranic text and the well-defined rules of Tajweed, we propose that an effective evaluation system should be built upon rule-based acoustic modeling centered on canonical pronunciation principles and articulation points (Makhraj), rather than depending on statistical patterns derived from flawed or biased data. The review concludes that the future of automated Quranic recitation assessment lies in hybrid systems that combine linguistic expertise with advanced audio processing. Such an approach paves the way for developing reliable, fair, and pedagogically effective tools that can authentically assist learners across the globe.
- Asia > Middle East > Syria > Damascus Governorate > Damascus (0.04)
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
- Asia > Pakistan (0.04)
- (5 more...)
- Instructional Material (0.93)
- Overview (0.88)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.46)
- Education > Educational Setting > Online (0.93)
- Education > Educational Technology > Educational Software > Computer Based Training (0.68)
- Information Technology > Security & Privacy (0.67)
E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis
Zhang, Zhisheng, Wang, Derui, Mi, Yifan, Wu, Zhiyong, Gao, Jie, Cao, Yuxin, Ye, Kai, Xue, Minhui, Hao, Jie
Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs. Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. Specifically, we employ the encoder ensemble with a feature extractor to protect timbre, while ASR-targeted adversarial examples disrupt pronunciation. Moreover, we incorporate the psychoacoustic model to ensure perturbative imperceptibility. For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard's effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/.
- Asia > China > Guangdong Province > Shenzhen (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech
Kotoge, Rikuto, Sasaki, Yuichi
Aligning text-to-speech (TTS) system outputs with human feedback through preference optimization has been shown to effectively improve the robustness and naturalness of language model-based TTS models. Current approaches primarily require paired desirable and undesirable samples at the utterance level. However, such pairs are often limited in TTS output data, and utterance-level formulation prevents fine-grained token-level optimization needed for accurate pronunciation alignment. In this study, we propose TKTO that eliminates the need for paired data, enabling a more data-efficient training paradigm, and directly targets token-level units, automatically providing fine-grained alignment signals without token-level annotations. TKTO improves the challenging Japanese TTS accuracy by 39% and reduces CER by 54%, automatically assigning 12.8 times stronger reward to targeted tokens.
OLaPh: Optimal Language Phonemizer
Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Netherlands > South Holland > Dordrecht (0.04)
- (2 more...)
UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech
ABSTRACT We propose UtterT une, a lightweight adaptation method that fine-tunes a multilingual text-to-speech (TTS) system based on a large language model (LLM) architecture, designed to enhance the controllability of pronunciation in a target language while preserving performance in others. While LLM architectures have enabled TTS models to achieve remarkable naturalness, accurately modeling grapheme-to-phoneme (G2P) mapping and prosody remains challenging, especially when the model omits an explicit G2P module and directly processes minimally encoded text (e.g., byte-pair encoding). UtterTune leverages low-rank adaptation to enable the control of segmental pronunciation and pitch accent at the phoneme level for Japanese speech, the target language in this paper, while maintaining naturalness and speaker similarity in a zero-shot setting. Objective and subjective evaluations confirm its effectiveness. Index T erms-- text-to-speech, large language model, low-rank adaptation, pronunciation, controllability 1. INTRODUCTION Text-to-speech (TTS) models based on large language model (LLM) architecture (LLM-TTS in this paper) have demonstrated exceptional naturalness, especially in zero-shot multi-speaker and multilingual synthesis, leading the way in speech synthesis technology [1, 2, 3, 4]; however, reproducing accurate pronunciation remains challenging. Some multilingual LLM-TTS, such as CosyV oice 2 [4], are designed to take raw text (characters) as input and tokenize it via byte-pair encoding (BPE) [5], without explicit phonemic or prosody markers. This design contrasts with conventional neural sequence-to-sequence TTS, which typically converts input text into phonemes (grapheme-to-phoneme; G2P) and prosody information, if needed, using a text frontend before feeding it into the model [6, 7, 8, 1]. On the other hand, such models require a large amount of speech-text pairs that cover the diversity of the target language because they predict segmental pronunciation and prosody data-driven.
'I love you too!' My family's creepy, unsettling week with an AI toy
'Let's talk about something fun!' Grem the AI chatbot toy. 'Let's talk about something fun!' Grem the AI chatbot toy. 'I love you too!' My family's creepy, unsettling week with an AI toy The cuddly chatbot Grem is designed to'learn' your child's personality, while every conversation they have is recorded, then transcribed by a third party. It wasn't long before I wanted this experiment to be over ... 'I'm going to throw that thing into a river!" my wife says as she comes down the stairs looking frazzled after putting our four-year-old daughter to bed. To be clear, "that thing" is not our daughter, Emma*. It's Grem, an AI-powered stuffed alien toy that the musician Claire Boucher, better known as Grimes, helped develop with toy company Curio. Designed for kids aged three and over and built with OpenAI's technology, the toy is supposed to "learn" your child's personality and have fun, educational conversations with them. It's advertised as a healthier alternative to screen time and is ...
- North America > United States (0.29)
- Europe > United Kingdom (0.15)
- Europe > Ukraine (0.05)
- (5 more...)
- Government > Regional Government (0.70)
- Leisure & Entertainment > Sports (0.69)
Graph Connectionist Temporal Classification for Phoneme Recognition
Automatic Phoneme Recognition (APR) systems are often trained using pseudo phoneme-level annotations generated from text through Grapheme-to-Phoneme (G2P) systems. These G2P systems frequently output multiple possible pronunciations per word, but the standard Connectionist Temporal Classification (CTC) loss cannot account for such ambiguity during training. In this work, we adapt Graph Temporal Classification (GTC) to the APR setting. GTC enables training from a graph of alternative phoneme sequences, allowing the model to consider multiple pronunciations per word as valid supervision. Our experiments on English and Dutch data sets show that incorporating multiple pronunciations per word into the training loss consistently improves phoneme error rates compared to a baseline trained with CTC. These results suggest that integrating pronunciation variation into the loss function is a promising strategy for training APR systems from noisy G2P-based supervision.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Canada > Quebec > Montreal (0.05)
- North America > Canada > Ontario > Toronto (0.05)
- (15 more...)
English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM
This study demonstrates that a Multimodal Large Language Model (MLLM) adapted via Low-Rank Adaptation (LoRA) can perform both Automatic Pronunciation Assessment (APA) and Mispronunciation Detection and Diagnosis (MDD) simultaneously. Leveraging Microsoft's Phi-4-multimodal-instruct, our fine-tuning method eliminates the need for complex architectural changes or separate training procedures conventionally required for these distinct tasks. Fine-tuned on the Speechocean762 dataset, the pronunciation evaluation scores predicted by the model exhibited a strong Pearson Correlation Coefficient (PCC > 0.7) with human-assigned scores, while achieving low Word Error Rate (WER) and Phoneme Error Rate (PER) (both < 0.15). Notably, fine-tuning only the LoRA layers was sufficient to achieve performance levels comparable to those achieved by fine-tuning all audio layers. This research highlights that an integrated pronunciation assessment system can be established by adapting large multimodal models without full fine-tuning, utilizing a significantly simpler training methodology compared to previous joint models designed for simultaneous APA and MDD. This efficient LoRA-based approach paves the way for more accessible, integrated, and effective Computer-Assisted Pronunciation Training (CAPT) technologies for English L2 learners.
Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning
Abdelfattah, Abdullah, Khalil, Mahmoud I., Abbas, Hazem
Assessing spoken language is challenging, and quantifying pronunciation metrics for machine learning models is even harder. However, for the Holy Quran, this task is simplified by the rigorous recitation rules (tajweed) established by Muslim scholars, enabling highly effective assessment. Despite this advantage, the scarcity of high-quality annotated data remains a significant barrier. In this work, we bridge these gaps by introducing: (1) A 98% automated pipeline to produce high-quality Quranic datasets -- encompassing: Collection of recitations from expert reciters, Segmentation at pause points (waqf) using our fine-tuned wav2vec2-BERT model, Transcription of segments, Transcript verification via our novel Tasmeea algorithm; (2) 850+ hours of audio (~300K annotated utterances); (3) A novel ASR-based approach for pronunciation error detection, utilizing our custom Quran Phonetic Script (QPS) to encode Tajweed rules (unlike the IPA standard for Modern Standard Arabic). QPS uses a two-level script: (Phoneme level): Encodes Arabic letters with short/long vowels. (Sifa level): Encodes articulation characteristics of every phoneme. We further include comprehensive modeling with our novel multi-level CTC Model which achieved 0.16% average Phoneme Error Rate (PER) on the testset. We release all code, data, and models as open-source: https://obadx.github.io/prepare-quran-dataset/
- Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)