human speech
The State Of TTS: A Case Study with Human Fooling Rates
Varadhan, Praveen Srinivasa, Thomas, Sherry, S., Sai Teja M., Bhooshan, Suvrat, Khapra, Mitesh M.
While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests.
- Asia > India (0.15)
- North America > United States (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (3 more...)
Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese
Wang, Xihuai, Zhao, Ziyi, Ren, Siyu, Zhang, Shao, Li, Song, Li, Xiaoyu, Wang, Ziwen, Qiu, Lin, Wan, Guanglu, Cao, Xuezhi, Cai, Xunliang, Zhang, Weinan
Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited interpretability. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness. To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation. Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design. Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool. The white-box ATT-Corpus and Auto-ATT can be found in ATT Hugging Face Collection (https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4).
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Speech (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Issues > Turing's Test (1.00)
Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing
Sarkar, Eklavya, -Doss, Mathew Magimai.
Self-supervised learning (SSL) foundation models have emerged as powerful, domain-agnostic, general-purpose feature extractors applicable to a wide range of tasks. Such models pre-trained on human speech have demonstrated high transferability for bioacoustic processing. This paper investigates (i) whether SSL models pre-trained directly on animal vocalizations offer a significant advantage over those pre-trained on speech, and (ii) whether fine-tuning speech-pretrained models on automatic speech recognition (ASR) tasks can enhance bioacoustic classification. We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. Results indicate that pre-training on bioacoustic data provides only marginal improvements over speech-pretrained models, with comparable performance in most scenarios. Fine-tuning on ASR tasks yields mixed outcomes, suggesting that the general-purpose representations learned during SSL pre-training are already well-suited for bioacoustic tasks. These findings highlight the robustness of speech-pretrained SSL models for bioacoustics and imply that extensive fine-tuning may not be necessary for optimal performance.
- North America > United States (0.14)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
Is a Chat with a Bot a Conversation?
You are at the Princess's ball, and she is telling you a secret, but her orchestra of bears is making such a fearful lot of noise you cannot hear what she is saying. What do you say, dear? I'd lean in closer and say, "Could you repeat that? The bear-itone section is a bit too enthusiastic tonight!" In 1958, the year the illustrated children's book "What Do You Say, Dear?" appeared, the leaders of a field newly dubbed "artificial intelligence" spoke at a conference in Teddington, England, on "The Mechanisation of Thought Processes." Marvin Minsky, of M.I.T., talked about heuristic programming; Alan Turing gave a paper called "Learning Machines"; Grace Hopper assessed the state of computer languages; and scientists from Bell Labs débuted a computer that could synthesize human speech by having it sing "Daisy Bell" ("Daisy, Daisy, give me your answer, do . .
- Europe > United Kingdom > England (0.24)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Europe > France (0.04)
Spectral oversubtraction? An approach for speech enhancement after robot ego speech filtering in semi-real-time
Li, Yue, Hindriks, Koen V., Kunneman, Florian A.
Spectral subtraction, widely used for its simplicity, has been employed to address the Robot Ego Speech Filtering (RESF) problem for detecting speech contents of human interruption from robot's single-channel microphone recordings when it is speaking. However, this approach suffers from oversubtraction in the fundamental frequency range (FFR), leading to degraded speech content recognition. To address this, we propose a Two-Mask Conformer-based Metric Generative Adversarial Network (CMGAN) to enhance the detected speech and improve recognition results. Our model compensates for oversubtracted FFR values with high-frequency information and long-term features and then de-noises the new spectrogram. In addition, we introduce an incremental processing method that allows semi-real-time audio processing with streaming input on a network trained on long fixed-length input. Evaluations of two datasets, including one with unseen noise, demonstrate significant improvements in recognition accuracy and the effectiveness of the proposed two-mask approach and incremental processing, enhancing the robustness of the proposed RESF pipeline in real-world HRI scenarios.
- Europe > Netherlands > North Holland > Amsterdam (0.05)
- Oceania > Australia > Queensland (0.04)
On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis
Sarkar, Eklavya, -Doss, Mathew Magimai.
Marmoset monkeys encode vital information in their calls and serve as a surrogate model for neuro-biologists to understand the evolutionary origins of human vocal communication. Traditionally analyzed with signal processing-based features, recent approaches have utilized self-supervised models pre-trained on human speech for feature extraction, capitalizing on their ability to learn a signal's intrinsic structure independently of its acoustic domain. However, the utility of such foundation models remains unclear for marmoset call analysis in terms of multi-class classification, bandwidth, and pre-training domain. This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks. Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- (2 more...)
Towards Dog Bark Decoding: Leveraging Human Speech Processing for Automated Bark Classification
Abzaliev, Artem, Espinosa, Humberto Pérez, Mihalcea, Rada
Similar to humans, animals make extensive use of verbal and non-verbal forms of communication, including a large range of audio signals. In this paper, we address dog vocalizations and explore the use of self-supervised speech representation models pre-trained on human speech to address dog bark classification tasks that find parallels in human-centered tasks in speech recognition. We specifically address four tasks: dog recognition, breed identification, gender classification, and context grounding. We show that using speech embedding representations significantly improves over simpler classification baselines. Further, we also find that models pre-trained on large human speech acoustics can provide additional performance boosts on several tasks.
- North America > United States > Michigan (0.04)
- North America > Mexico > Tlaxcala (0.04)
- North America > Mexico > Puebla > Puebla (0.04)
- (2 more...)
AI Scam Calls: How to Protect Yourself, How to Detect
You answer a random call from a family member, and they breathlessly explain how there's been a horrible car accident. They need you to send money right now, or they'll go to jail. You can hear the desperation in their voice as they plead for an immediate cash transfer. While it sure sounds like them, and the call came from their number, you feel like something's off. So, you decide to hang up and call them right back.
Single-Channel Robot Ego-Speech Filtering during Human-Robot Interaction
Li, Yue, Hindriks, Koen V, Kunneman, Florian
In this paper, we study how well human speech can automatically be filtered when this overlaps with the voice and fan noise of a social robot, Pepper. We ultimately aim for an HRI scenario where the microphone can remain open when the robot is speaking, enabling a more natural turn-taking scheme where the human can interrupt the robot. To respond appropriately, the robot would need to understand what the interlocutor said in the overlapping part of the speech, which can be accomplished by target speech extraction (TSE). To investigate how well TSE can be accomplished in the context of the popular social robot Pepper, we set out to manufacture a datase composed of a mixture of recorded speech of Pepper itself, its fan noise (which is close to the microphones), and human speech as recorded by the Pepper microphone, in a room with low reverberation and high reverberation. Comparing a signal processing approach, with and without post-filtering, and a convolutional recurrent neural network (CRNN) approach to a state-of-the-art speaker identification-based TSE model, we found that the signal processing approach without post-filtering yielded the best performance in terms of Word Error Rate on the overlapping speech signals with low reverberation, while the CRNN approach is more robust for reverberation. These results show that estimating the human voice in overlapping speech with a robot is possible in real-life application, provided that the room reverberation is low and the human speech has a high volume or high pitch.
- North America > United States > Colorado > Boulder County > Boulder (0.15)
- Europe > Netherlands > North Holland > Amsterdam (0.05)
- Asia > Middle East > Israel (0.04)
- (2 more...)
ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds
Hagiwara, Masato, Miron, Marius, Liu, Jen-Yu
Traditionally, bioacoustics has relied on spectrograms and continuous, per-frame audio representations for the analysis of animal sounds, also serving as input to machine learning models. Meanwhile, the International Phonetic Alphabet (IPA) system has provided an interpretable, language-independent method for transcribing human speech sounds. In this paper, we introduce ISPA (Inter-Species Phonetic Alphabet), a precise, concise, and interpretable system designed for transcribing animal sounds into text. We compare acoustics-based and feature-based methods for transcribing and classifying animal sounds, demonstrating their comparable performance with baseline methods utilizing continuous, dense audio representations. By representing animal sounds with text, we effectively treat them as a "foreign language," and we show that established human language ML paradigms and models, such as language models, can be successfully applied to improve performance.