AITopics | synthetic speech

Collaborating Authors

synthetic speech

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Forensic deepfake audio detection using segmental speech features

Yang, Tianle, Sun, Chengzhe, Lyu, Siwei, Rose, Phil

arXiv.org Artificial IntelligenceDec-12-2025

This study explores the potential of using acoustic features of segmental speech sounds to detect deepfake audio. These features are highly interpretable because of their close relationship with human articulatory processes and are expected to be more difficult for deepfake models to replicate. The results demonstrate that certain segmental features commonly used in forensic voice comparison (FVC) are effective in identifying deep-fakes, whereas some global features provide little value. These findings underscore the need to approach audio deepfake detection using methods that are distinct from those employed in traditional FVC, and offer a new perspective on leveraging segmental features for this purpose. In addition, the present study proposes a speaker-specific framework for deepfake detection, which differs fundamentally from the speaker-independent systems that dominate current benchmarks. While speaker-independent frameworks aim at broad generalization, the speaker-specific approach offers advantages in forensic contexts where case-by-case interpretability and sensitivity to individual phonetic realization are essential.

artificial intelligence, detection, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.forsciint.2025.112768

2505.13847

Country: North America > United States (0.93)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

HuLA: Prosody-Aware Anti-Spoofing with Multi-Task Learning for Expressive and Emotional Synthetic Speech

Mahapatra, Aurosweta, Ulgen, Ismail Rasim, Sisman, Berrak

arXiv.org Artificial IntelligenceSep-29-2025

Abstract--Current anti-spoofing systems remain vulnerable to expressive and emotional synthetic speech, since they rarely leverage prosody as a discriminative cue. In this paper, we propose HuLA, a two-stage prosody-aware multi-task learning framework for spoof detection. In Stage 2, the model is jointly optimized for spoof detection and prosody tasks on both real and synthetic data, leveraging prosodic awareness to detect mismatches between natural and expressive synthetic speech. Experiments show that HuLA consistently outperforms strong baselines on challenging out-of-domain dataset, including expressive, emotional, and cross-lingual attacks. These results demonstrate that explicit prosodic supervision, combined with SSL embeddings, substantially improves robustness against advanced synthetic speech attacks. Anti-spoofing aims to detect audio generated through replay attacks, speech synthesis, and voice conversion (VC) [1]. Recent progress in text-to-speech (TTS) [2]-[6] and VC systems [7]-[11] has amplified concerns about expressive synthetic speech, which can be misused to compromise bio-metric authentication or impersonate speakers for spreading misinformation [12], [13]. One of the goals of speech generation is to produce speech that is natural and indistinguishable from human speech. Expressiveness and emotion are the defining characteristics of human speech. While this limitation is a weakness for synthesis, it represents a valuable opportunity for anti-spoofing: imperfect expressive-A. Mahapatra is with the Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 USA (e-mail: amahapa2@jhu.edu). Ismail Rasim Ulgen is with the Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 USA (e-mail: iulgen1@jhu.edu).

artificial intelligence, machine learning, speech, (16 more...)

arXiv.org Artificial Intelligence

2509.21676

Country:

North America > United States > Maryland > Baltimore (0.44)
North America > United States > California (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset

Yan, Brian, Hamed, Injy, Shimizu, Shuichiro, Lodagala, Vasista, Chen, William, Iakovenko, Olga, Talafha, Bashar, Hussein, Amir, Polok, Alexander, Chang, Kalvin, Klement, Dominik, Althubaiti, Sara, Peng, Puyuan, Wiesner, Matthew, Solorio, Thamar, Ali, Ahmed, Khudanpur, Sanjeev, Watanabe, Shinji, Chen, Chih-Chen, Wu, Zhen, Benharrak, Karim, Diwan, Anuj, Cornell, Samuele, Yeo, Eunjung, Choi, Kwanghee, Carvalho, Carlos, Rosero, Karen

arXiv.org Artificial IntelligenceSep-18-2025

CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2509.14161

Country:

Asia (0.68)
Europe (0.46)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

HISPASpoof: A New Dataset For Spanish Speech Forensics

Risques, Maria, Bhagtani, Kratika, Yadav, Amit Kumar Singh, Delp, Edward J.

arXiv.org Artificial IntelligenceSep-12-2025

West Lafayette, Indiana, USA Abstract--Zero-shot V oice Cloning (VC) and T ext-to-Speech (TTS) methods have advanced rapidly, enabling the generation of highly realistic synthetic speech and raising serious concerns about their misuse. While numerous detectors have been developed for English and Chinese, Spanish--spoken by over 600 million people worldwide--remains underrepresented in speech forensics. T o address this gap, we introduce HISPASpoof, the first large-scale Spanish dataset designed for synthetic speech detection and attribution. It includes real speech from public corpora across six accents and synthetic speech generated with six zero-shot TTS systems. We evaluate five representative methods, showing that detectors trained on English fail to generalize to Spanish, while training on HISPASpoof substantially improves detection. We also evaluate synthetic speech attribution performance on HISPASpoof, i.e., identifying the generation method of synthetic speech. HISPASpoof thus provides a critical benchmark for advancing reliable and inclusive speech forensics in Spanish. The rapid advancement of speech synthesis techniques has significantly transformed the area of audio generation and speech forensics. Recent Text-to-Speech (TTS) and V oice Cloning (VC) methods [1], [2], [3], [4], [5], [6] are now capable of producing highly realistic synthetic voices that closely mimic the spectral, prosodic, and linguistic traits of real human speech [7], [8], [9], [10].

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2509.09155

Country:

North America > United States > California (0.68)
North America > United States > Indiana > Tippecanoe County > West Lafayette (0.24)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.24)

Genre: Research Report > New Finding (0.68)

Industry:

Media (0.68)
Information Technology > Security & Privacy (0.47)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
(2 more...)

Add feedback

Whilter: A Whisper-based Data Filter for "In-the-Wild" Speech Corpora Using Utterance-level Multi-Task Classification

Ravenscroft, William, Close, George, Bower-Morris, Kit, Stacey, Jamie, Sityaev, Dmitry, Hong, Kris Y.

arXiv.org Artificial IntelligenceAug-26-2025

Large-scale in-the-wild speech datasets have become more prevalent in recent years due to increased interest in models that can learn useful features from unlabelled data for tasks such as speech recognition or synthesis. These datasets often contain undesirable features, such as multiple speakers, non-target languages, and music, which may impact model learning. The Whilter model is proposed as a multitask solution to identify these undesirable samples. Whilter uses a Whisper encoder with an attention-based classifier to solve five diverse classification problems at once. In addition, an annotated dataset is published for a subset of two popular in-the-wild corpora. Whilter achieves F1 scores above 85% and equal error rates of 6.5% to 7.8% for three of five subtasks, outperforming a state-of-the-art BEATs classifier on speech-specific classes, with a notable decrease in processing time compared to a combination of single-task alternatives.

artificial intelligence, machine learning, speech recognition, (20 more...)

arXiv.org Artificial Intelligence

2507.21642

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Improving French Synthetic Speech Quality via SSML Prosody Control

Ouali, Nassima Ould, Sani, Awais Hussain, Bueno, Ruben, Dauvet, Jonah, Horstmann, Tim Luka, Moulines, Eric

arXiv.org Artificial IntelligenceAug-26-2025

Despite recent advances, synthetic voices often lack expressiveness due to limited prosody control in commercial text-to-speech (TTS) systems. We introduce the first end-to-end pipeline that inserts Speech Synthesis Markup Language (SSML) tags into French text to control pitch, speaking rate, volume, and pause duration. We employ a cascaded architecture with two QLoRA-fine-tuned Qwen 2.5-7B models: one predicts phrase-break positions and the other performs regression on prosodic targets, generating commercial TTS-compatible SSML markup. Evaluated on a 14-hour French podcast corpus, our method achieves 99.2% F1 for break placement and reduces mean absolute error on pitch, rate, and volume by 25-40% compared with prompting-only large language models (LLMs) and a BiLSTM baseline. In perceptual evaluation involving 18 participants across over 9 hours of synthesized audio, SSML-enhanced speech generated by our pipeline significantly improves naturalness, with the mean opinion score increasing from 3.20 to 3.87 (p < 0.005). Additionally, 15 of 18 listeners preferred our enhanced synthesis. These results demonstrate substantial progress in bridging the expressiveness gap between synthetic and natural French speech. Our code is publicly available at https://github.com/hi-paris/Prosody-Control-French-TTS.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2508.17494

Country: North America > Canada (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

Minixhofer, Christoph, Klejch, Ondrej, Bell, Peter

arXiv.org Artificial IntelligenceJun-25-2025

Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a multilingual test dataset to avoid data leakage; and a continually updated benchmark for TTS in 14 languages.

artificial intelligence, correlation, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2506.19441

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Can Emotion Fool Anti-spoofing?

Mahapatra, Aurosweta, Ulgen, Ismail Rasim, Naini, Abinay Reddy, Busso, Carlos, Sisman, Berrak

arXiv.org Artificial IntelligenceJun-2-2025

Traditional anti-spoofing focuses on models and datasets built on synthetic speech with mostly neutral state, neglecting diverse emotional variations. As a result, their robustness against high-quality, emotionally expressive synthetic speech is uncertain. We address this by introducing EmoSpoof-TTS, a corpus of emotional text-to-speech samples. Our analysis shows existing anti-spoofing models struggle with emotional synthetic speech, exposing risks of emotion-targeted attacks. Even trained on emotional data, the models underperform due to limited focus on emotional aspect and show performance disparities across emotions. This highlights the need for emotion-focused anti-spoofing paradigm in both dataset and methodology. We propose GEM, a gated ensemble of emotion-specialized models with a speech emotion recognition gating network. GEM performs effectively across all emotions and neutral state, improving defenses against spoofing attacks.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.23962

Country: North America > United States (0.15)

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.67)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

Wang, Tianduo, Xu, Lu, Lu, Wei, Cheng, Shanbo

arXiv.org Artificial IntelligenceMay-23-2025

Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.16972

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

APG-MOS: Auditory Perception Guided-MOS Predictor for Synthetic Speech

Lian, Zhicheng, Wang, Lizhi, Huang, Hua

arXiv.org Artificial IntelligenceApr-30-2025

Automatic speech quality assessment aims to quantify subjective human perception of speech through computational models to reduce the need for labor-consuming manual evaluations. While models based on deep learning have achieved progress in predicting mean opinion scores (MOS) to assess synthetic speech, the neglect of fundamental auditory perception mechanisms limits consistency with human judgments. To address this issue, we propose an auditory perception guided-MOS prediction model (APG-MOS) that synergistically integrates auditory modeling with semantic analysis to enhance consistency with human judgments. Specifically, we first design a perceptual module, grounded in biological auditory mechanisms, to simulate cochlear functions, which encodes acoustic signals into biologically aligned electrochemical representations. Secondly, we propose a residual vector quantization (RVQ)-based semantic distortion modeling method to quantify the degradation of speech quality at the semantic level. Finally, we design a residual cross-attention architecture, coupled with a progressive learning strategy, to enable multimodal fusion of encoded electrochemical signals and semantic representations. Experiments demonstrate that APG-MOS achieves superior performance on two primary benchmarks. Our code and checkpoint will be available on a public repository upon publication.

artificial intelligence, auditory perception guided-mo predictor, synthetic speech, (1 more...)

arXiv.org Artificial Intelligence

2504.20447

Genre: Research Report (0.40)

Industry: Health & Medicine > Consumer Health (0.80)

Technology: Information Technology > Artificial Intelligence (0.73)

Add feedback