speech dataset
EuroSpeech: A Multilingual Speech Corpus
Pfisterer, Samuel, Grรถtschla, Florian, Lanzendรถrfer, Luca A., Yan, Florian, Wattenhofer, Roger
Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for most languages. Thus, trained models perform poorly on the majority of the supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8\% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.
ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Kalahroodi, Mohammad Javad Ranjbar, Faili, Heshaam, Shakery, Azadeh
Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for text-to-speech(TTS) applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and audio-text quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. To validate the dataset, we fine-tuned XTTS for Persian, achieving a naturalness Mean Opinion Score (MOS) of 3.6/5 and a Speaker Similarity Mean Opinion Score (SMOS) of 4.0/5 demonstrating ParsVoice's effectiveness for training multi-speaker TTS systems. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.
Bridging Gaps in Hate Speech Detection: Meta-Collections and Benchmarks for Low-Resource Iberian Languages
Piot, Paloma, Campos, Josรฉ Ramom Pichel, Parapar, Javier
Hate speech poses a serious threat to social cohesion and individual well-being, particularly on social media, where it spreads rapidly. While research on hate speech detection has progressed, it remains largely focused on English, resulting in limited resources and benchmarks for low-resource languages. Moreover, many of these languages have multiple linguistic varieties, a factor often overlooked in current approaches. At the same time, large language models require substantial amounts of data to perform reliably, a requirement that low-resource languages often cannot meet. In this work, we address these gaps by compiling a meta-collection of hate speech datasets for European Spanish, standardised with unified labels and metadata. This collection is based on a systematic analysis and integration of existing resources, aiming to bridge the data gap and support more consistent and scalable hate speech detection. We extended this collection by translating it into European Portuguese and into a Galician standard that is more convergent with Spanish and another Galician variant that is more convergent with Portuguese, creating aligned multilingual corpora. Using these resources, we establish new benchmarks for hate speech detection in Iberian languages. We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, providing baseline results for future research. Moreover, we perform a cross-lingual analysis with our target languages. Our findings underscore the importance of multilingual and variety-aware approaches in hate speech detection and offer a foundation for improved benchmarking in underrepresented European languages.
RV-HATE: Reinforced Multi-Module Voting for Implicit Hate Speech Detection
Lee, Yejin, Ahn, Hyeseon, Han, Yo-Sub
Hate speech remains prevalent in human society and continues to evolve in its forms and expressions. Modern advancements in internet and online anonymity accelerate its rapid spread and complicate its detection. However, hate speech datasets exhibit diverse characteristics primarily because they are constructed from different sources and platforms, each reflecting different linguistic styles and social contexts. Despite this diversity, prior studies on hate speech detection often rely on fixed methodologies without adapting to data-specific features. We introduce RV-HATE, a detection framework designed to account for the dataset-specific characteristics of each hate speech dataset. RV-HATE consists of multiple specialized modules, where each module focuses on distinct linguistic or contextual features of hate speech. The framework employs reinforcement learning to optimize weights that determine the contribution of each module for a given dataset. A voting mechanism then aggregates the module outputs to produce the final decision. RV-HATE offers two primary advantages: (1)~it improves detection accuracy by tailoring the detection process to dataset-specific attributes, and (2)~it also provides interpretable insights into the distinctive features of each dataset. Consequently, our approach effectively addresses implicit hate speech and achieves superior performance compared to conventional static methods. Our code is available at https://github.com/leeyejin1231/RV-HATE.
Automatic Speech Recognition (ASR) for African Low-Resource Languages: A Systematic Literature Review
Imam, Sukairaj Hafiz, Belay, Tadesse Destaw, Husse, Kedir Yassin, Ahmad, Ibrahim Said, Abdulmumin, Idris, Umar, Hadiza Ali, Bello, Muhammad Yahuza, Nakatumba-Nabende, Joyce, Yimam, Seid Muhie, Muhammad, Shamsuddeen Hassan
ASR has achieved remarkable global progress, yet African low-resource languages remain rigorously underrepresented, producing barriers to digital inclusion across the continent with more than +2000 languages. This systematic literature review (SLR) explores research on ASR for African languages with a focus on datasets, models and training methods, evaluation techniques, challenges, and recommends future directions. We employ the PRISMA 2020 procedures and search DBLP, ACM Digital Library, Google Scholar, Semantic Scholar, and arXiv for studies published between January 2020 and July 2025. We include studies related to ASR datasets, models or metrics for African languages, while excluding non-African, duplicates, and low-quality studies (score <3/5). We screen 71 out of 2,062 records and we record a total of 74 datasets across 111 languages, encompassing approximately 11,206 hours of speech. Fewer than 15% of research provided reproducible materials, and dataset licensing is not clear. Self-supervised and transfer learning techniques are promising, but are hindered by limited pre-training data, inadequate coverage of dialects, and the availability of resources. Most of the researchers use Word Error Rate (WER), with very minimal use of linguistically informed scores such as Character Error Rate (CER) or Diacritic Error Rate (DER), and thus with limited application in tonal and morphologically rich languages. The existing evidence on ASR systems is inconsistent, hindered by issues like dataset availability, poor annotations, licensing uncertainties, and limited benchmarking. Nevertheless, the rise of community-driven initiatives and methodological advancements indicates a pathway for improvement. Sustainable development for this area will also include stakeholder partnership, creation of ethically well-balanced datasets, use of lightweight modelling techniques, and active benchmarking.
Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages
Li, Chin-Jou, Yeo, Eunjung, Choi, Kwanghee, Pรฉrez-Toro, Paula Andrea, Someki, Masao, Das, Rohan Kumar, Yue, Zhengjun, Orozco-Arroyave, Juan Rafael, Nรถth, Elmar, Mortensen, David R.
Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GIT A (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.
AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit
Zhu, Yi, Guimarรฃes, Heitor R., Pimentel, Arthur, Falk, Tiago
With the prevalence of artificial intelligence (AI)-generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, most models are evaluated on a narrow set of datasets, leaving their generalization to real-world conditions uncertain. In this paper, we systematically review 28 existing audio deepfake datasets and present an open-source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across these 28 datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, using a widely adopted pretrained deepfake detector, we present in- and out-of-domain detection results, revealing notable differences across conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gap relative to practical deployment scenarios.
MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech
Mai, Jialong, Ji, Jinxin, Xing, Xiaofen, Yang, Chen, Chen, Weidong, Xing, Jingyuan, Xu, Xiangmin
Mainstream Automatic Speech Recognition (ASR) systems excel at transcribing lexical content, but largely fail to recognize nonverbal vocalizations (NVs) embedded in speech, such as sighs, laughs, and coughs. This capability is important for a comprehensive understanding of human communication, as NVs convey crucial emotional and intentional cues. Progress in NV-aware ASR has been hindered by the lack of high-quality, well-annotated datasets. To address this gap, we introduce MNV-17, a 7.55-hour performative Mandarin speech dataset. Unlike most existing corpora that rely on model-based detection, MNV-17's performative nature ensures high-fidelity, clearly articulated NV instances. To the best of our knowledge, MNV-17 provides the most extensive set of nonverbal vocalization categories, comprising 17 distinct and well-balanced classes of common NVs. We benchmarked MNV-17 on four mainstream ASR architectures, evaluating their joint performance on semantic transcription and NV classification. The dataset and the pretrained model checkpoints will be made publicly available to facilitate future research in expressive ASR.
UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech
Liu, Jiaxuan, Xiang, Yang, Zhao, Han, Li, Xiangang, Gao, Yingying, Zhang, Shilei, Ling, Zhenhua
Recent large language models (LLMs) have made great progress in the field of text-to-speech (TTS), but they still face major challenges in synthesizing fine-grained emotional speech in an interpretable manner. Traditional methods rely on discrete emotion labels to control emotion categories and intensities, which cannot capture the complexity and continuity of human emotional perception and expression. The lack of large-scale emotional speech datasets with balanced emotion distributions and fine-grained emotional annotations often causes overfitting in synthesis models and impedes effective emotion control. To address these issues, we propose UDDETTS, a universal LLM framework unifying discrete and dimensional emotions for controllable emotional TTS. This model introduces the interpretable Arousal-Dominance-V alence (ADV) space for dimensional emotion description and supports emotion control driven by either discrete emotion labels or nonlinearly quantified ADV values. Furthermore, a semi-supervised training strategy is designed to comprehensively utilize diverse speech datasets with different types of emotional annotations to train the UDDETTS. Experiments show that UDDETTS achieves linear emotion control along three interpretable dimensions, and exhibits superior end-to-end emotional speech synthesis capabilities. Code and demos are available at: https://anonymous.4open.science/ It is designed for large-scale emotional speech datasets and integrates discrete label and dimensional ADV annotations to enable controllable emotional TTS. These models leverage the strong language understanding of LLMs to generate speech semantic tokens from text tokens, thereby achieving significant advantages in synthesizing expressive speech. Current LLM-based methods primarily rely on emotion prompts for supervised fine-tuning.