AITopics

Country:

Europe (0.45)
North America > United States (0.28)
Asia (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(3 more...)

Neural Information Processing SystemsJun-15-2026, 19:17:05 GMT

E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis

e2e-vguard, large language model, machine learning, (21 more...)

Country: Asia > China (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-9-2026, 14:16:37 GMT

69c754f571806bf15add18556ff39b4f-Supplemental-Conference.pdf

Similar to the previous analysis of XLSR-53 (Choi et al., 2021), the representations from the 1st layer of XLS-R are already clustered by each speaker while it is hard to distinguish the representations of thelatterlayerbyeachspeaker. HierSpeech-UVCTK+LibriTTS (20) 3.71 15.85 6.40 4.09 30.64Untranscribed text-to-speech We describe the results of the objective evaluation for speaker adaptationinTable11. Hence, the data augmentation for speech disentanglement is not necessaryinourmethod. Note that we fail to train the model with the representations from the 23th layer of XLS-R. We train Tacotron 2 with batch size of 256 for 100k steps.

artificial intelligence, representation, speech recognition, (14 more...)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)

Neural Information Processing SystemsFeb-8-2026, 20:42:48 GMT

Dict-TTS: LearningtoPronouncewithPrior DictionaryKnowledgeforText-to-Speech

Polyphone disambiguation aims to capture accurate pronunciation knowledge fromnaturaltextsequences forreliable Text-to-speech (TTS)systems.

artificial intelligence, machine learning, natural language, (18 more...)

Country:

Europe > Czechia > South Moravian Region > Brno (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > United States > California > Santa Clara County > Sunnyvale (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.37)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.36)

Al-Kharusi, Mohammed Hilal, Hayat, Khizar, Ruqeishi, Khalil Bader Al, Lone, Haroon Rashid

A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation

arXiv.org Artificial IntelligenceNov-14-2025

The art and science of Quranic recitation (Tajweed), a discipline governed by meticulous phonetic, rhythmic, and theological principles, confronts substantial educational challenges in today's digital age. Although modern technology offers unparalleled opportunities for learning, existing automated systems for evaluating recitation have struggled to gain broad acceptance or demonstrate educational effectiveness. This literature review examines this crucial disparity, offering a thorough analysis of scholarly research, digital platforms, and commercial tools developed over the past twenty years. Our analysis uncovers a fundamental flaw in current approaches that adapt Automatic Speech Recognition (ASR) systems, which emphasize word identification over qualitative acoustic evaluation. These systems suffer from limitations such as reliance on biased datasets, demographic disparities, and an inability to deliver meaningful feedback for improvement. Challenging these data-centric methodologies, we advocate for a paradigm shift toward a knowledge-based computational framework. By leveraging the unchanging nature of the Quranic text and the well-defined rules of Tajweed, we propose that an effective evaluation system should be built upon rule-based acoustic modeling centered on canonical pronunciation principles and articulation points (Makhraj), rather than depending on statistical patterns derived from flawed or biased data. The review concludes that the future of automated Quranic recitation assessment lies in hybrid systems that combine linguistic expertise with advanced audio processing. Such an approach paves the way for developing reliable, fair, and pedagogically effective tools that can authentically assist learners across the globe.

data mining, machine learning, natural language, (16 more...)

2510.12858

Country:

Asia > Middle East (0.92)
Africa > Middle East > Egypt (0.46)

Genre:

Instructional Material (0.93)
Overview (0.88)
Research Report > New Finding (0.67)
Research Report > Promising Solution (0.46)

Industry:

Education > Educational Setting > Online (0.93)
Education > Educational Technology > Educational Software > Computer Based Training (0.68)
Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
(4 more...)

arXiv.org Artificial IntelligenceNov-11-2025

E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis

Zhang, Zhisheng, Wang, Derui, Mi, Yifan, Wu, Zhiyong, Gao, Jie, Cao, Yuxin, Ye, Kai, Xue, Minhui, Hao, Jie

Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs. Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. Specifically, we employ the encoder ensemble with a feature extractor to protect timbre, while ASR-targeted adversarial examples disrupt pronunciation. Moreover, we incorporate the psychoacoustic model to ensure perturbative imperceptibility. For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard's effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/.

e2e-vguard, large language model, machine learning, (21 more...)

2511.07099

Country: Asia > China (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Kotoge, Rikuto, Sasaki, Yuichi

Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech

arXiv.org Artificial IntelligenceOct-8-2025

Aligning text-to-speech (TTS) system outputs with human feedback through preference optimization has been shown to effectively improve the robustness and naturalness of language model-based TTS models. Current approaches primarily require paired desirable and undesirable samples at the utterance level. However, such pairs are often limited in TTS output data, and utterance-level formulation prevents fine-grained token-level optimization needed for accurate pronunciation alignment. In this study, we propose TKTO that eliminates the need for paired data, enabling a more data-efficient training paradigm, and directly targets token-level units, automatically providing fine-grained alignment signals without token-level annotations. TKTO improves the challenging Japanese TTS accuracy by 39% and reduces CER by 54%, automatically assigning 12.8 times stronger reward to targeted tokens.

large language model, machine learning, natural language, (17 more...)

2510.05799

Country: Asia > Japan > Honshū (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

arXiv.org Artificial IntelligenceSep-25-2025

OLaPh: Optimal Language Phonemizer

Wirth, Johannes

Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.

large language model, machine learning, phonemization, (18 more...)

2509.20086

Country:

Europe > Netherlands (0.14)
Europe > Czechia (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.74)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

arXiv.org Artificial IntelligenceSep-24-2025

UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech

Kato, Shuhei

ABSTRACT We propose UtterT une, a lightweight adaptation method that fine-tunes a multilingual text-to-speech (TTS) system based on a large language model (LLM) architecture, designed to enhance the controllability of pronunciation in a target language while preserving performance in others. While LLM architectures have enabled TTS models to achieve remarkable naturalness, accurately modeling grapheme-to-phoneme (G2P) mapping and prosody remains challenging, especially when the model omits an explicit G2P module and directly processes minimally encoded text (e.g., byte-pair encoding). UtterTune leverages low-rank adaptation to enable the control of segmental pronunciation and pitch accent at the phoneme level for Japanese speech, the target language in this paper, while maintaining naturalness and speaker similarity in a zero-shot setting. Objective and subjective evaluations confirm its effectiveness. Index T erms-- text-to-speech, large language model, low-rank adaptation, pronunciation, controllability 1. INTRODUCTION Text-to-speech (TTS) models based on large language model (LLM) architecture (LLM-TTS in this paper) have demonstrated exceptional naturalness, especially in zero-shot multi-speaker and multilingual synthesis, leading the way in speech synthesis technology [1, 2, 3, 4]; however, reproducing accurate pronunciation remains challenging. Some multilingual LLM-TTS, such as CosyV oice 2 [4], are designed to take raw text (characters) as input and tokenize it via byte-pair encoding (BPE) [5], without explicit phonemic or prosody markers. This design contrasts with conventional neural sequence-to-sequence TTS, which typically converts input text into phonemes (grapheme-to-phoneme; G2P) and prosody information, if needed, using a text frontend before feeding it into the model [6, 7, 8, 1]. On the other hand, such models require a large amount of speech-text pairs that cover the diversity of the target language because they predict segmental pronunciation and prosody data-driven.

artificial intelligence, large language model, natural language, (15 more...)

2508.09767

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.16)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

The GuardianSep-16-2025, 04:00:27 GMT

'I love you too!' My family's creepy, unsettling week with an AI toy

'Let's talk about something fun!' Grem the AI chatbot toy. 'Let's talk about something fun!' Grem the AI chatbot toy. 'I love you too!' My family's creepy, unsettling week with an AI toy The cuddly chatbot Grem is designed to'learn' your child's personality, while every conversation they have is recorded, then transcribed by a third party. It wasn't long before I wanted this experiment to be over ... 'I'm going to throw that thing into a river!" my wife says as she comes down the stairs looking frazzled after putting our four-year-old daughter to bed. To be clear, "that thing" is not our daughter, Emma*. It's Grem, an AI-powered stuffed alien toy that the musician Claire Boucher, better known as Grimes, helped develop with toy company Curio. Designed for kids aged three and over and built with OpenAI's technology, the toy is supposed to "learn" your child's personality and have fun, educational conversations with them. It's advertised as a healthier alternative to screen time and is ...

chatbot, grem, view image, (14 more...)

The Guardian

Country:

North America > United States (0.29)
Europe > United Kingdom (0.15)
Europe > Ukraine (0.05)
(5 more...)

Genre: Personal (0.69)

Industry:

Government > Regional Government (0.70)
Leisure & Entertainment > Sports (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)