AITopics | human speech

Collaborating Authors

human speech

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

The State Of TTS: A Case Study with Human Fooling Rates

Varadhan, Praveen Srinivasa, Thomas, Sherry, S., Sai Teja M., Bhooshan, Suvrat, Khapra, Mitesh M.

arXiv.org Artificial IntelligenceAug-7-2025

While subjective evaluations in recent years indicate rapid progress in TTS, can current TTS systems truly pass a human deception test in a Turing-like evaluation? We introduce Human Fooling Rate (HFR), a metric that directly measures how often machine-generated speech is mistaken for human. Our large-scale evaluation of open-source and commercial TTS models reveals critical insights: (i) CMOS-based claims of human parity often fail under deception testing, (ii) TTS progress should be benchmarked on datasets where human speech achieves high HFRs, as evaluating against monotonous or less expressive reference samples sets a low bar, (iii) Commercial models approach human deception in zero-shot settings, while open-source systems still struggle with natural conversational speech; (iv) Fine-tuning on high-quality data improves realism but does not fully bridge the gap. Our findings underscore the need for more realistic, human-centric evaluations alongside existing subjective tests.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2508.04179

Country: North America > Mexico (0.28)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech (0.73)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.37)

Add feedback

Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese

Wang, Xihuai, Zhao, Ziyi, Ren, Siyu, Zhang, Shao, Li, Song, Li, Xiaoyu, Wang, Ziwen, Qiu, Lin, Wan, Guanglu, Cao, Xuezhi, Cai, Xunliang, Zhang, Weinan

arXiv.org Artificial IntelligenceMay-19-2025

Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited interpretability. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness. To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation. Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design. Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool. The white-box ATT-Corpus and Auto-ATT can be found in ATT Hugging Face Collection (https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4).

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.112

Country:

Europe > Greece (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Issues > Turing's Test (1.00)

Add feedback

Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing

Sarkar, Eklavya, -Doss, Mathew Magimai.

arXiv.org Artificial IntelligenceJan-10-2025

Self-supervised learning (SSL) foundation models have emerged as powerful, domain-agnostic, general-purpose feature extractors applicable to a wide range of tasks. Such models pre-trained on human speech have demonstrated high transferability for bioacoustic processing. This paper investigates (i) whether SSL models pre-trained directly on animal vocalizations offer a significant advantage over those pre-trained on speech, and (ii) whether fine-tuning speech-pretrained models on automatic speech recognition (ASR) tasks can enhance bioacoustic classification. We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. Results indicate that pre-training on bioacoustic data provides only marginal improvements over speech-pretrained models, with comparable performance in most scenarios. Fine-tuning on ASR tasks yields mixed outcomes, suggesting that the general-purpose representations learned during SSL pre-training are already well-suited for bioacoustic tasks. These findings highlight the robustness of speech-pretrained SSL models for bioacoustics and imply that extensive fine-tuning may not be necessary for optimal performance.

artificial intelligence, inductive learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2501.05987

Country: Europe > Switzerland (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.62)

Add feedback

Is a Chat with a Bot a Conversation?

The New YorkerSep-30-2024, 10:00:00 GMT

You are at the Princess's ball, and she is telling you a secret, but her orchestra of bears is making such a fearful lot of noise you cannot hear what she is saying. What do you say, dear? I'd lean in closer and say, "Could you repeat that? The bear-itone section is a bit too enthusiastic tonight!" In 1958, the year the illustrated children's book "What Do You Say, Dear?" appeared, the leaders of a field newly dubbed "artificial intelligence" spoke at a conference in Teddington, England, on "The Mechanisation of Thought Processes." Marvin Minsky, of M.I.T., talked about heuristic programming; Alan Turing gave a paper called "Learning Machines"; Grace Hopper assessed the state of computer languages; and scientists from Bell Labs débuted a computer that could synthesize human speech by having it sing "Daisy Bell" ("Daisy, Daisy, give me your answer, do . .

artificial intelligence, machine learning, natural language, (18 more...)

The New Yorker

Country:

Europe > United Kingdom > England (0.24)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Europe > France (0.04)

Industry: Education (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
Information Technology > Artificial Intelligence > History (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Spectral oversubtraction? An approach for speech enhancement after robot ego speech filtering in semi-real-time

Li, Yue, Hindriks, Koen V., Kunneman, Florian A.

arXiv.org Artificial IntelligenceSep-10-2024

Spectral subtraction, widely used for its simplicity, has been employed to address the Robot Ego Speech Filtering (RESF) problem for detecting speech contents of human interruption from robot's single-channel microphone recordings when it is speaking. However, this approach suffers from oversubtraction in the fundamental frequency range (FFR), leading to degraded speech content recognition. To address this, we propose a Two-Mask Conformer-based Metric Generative Adversarial Network (CMGAN) to enhance the detected speech and improve recognition results. Our model compensates for oversubtracted FFR values with high-frequency information and long-term features and then de-noises the new spectrogram. In addition, we introduce an incremental processing method that allows semi-real-time audio processing with streaming input on a network trained on long fixed-length input. Evaluations of two datasets, including one with unseen noise, demonstrate significant improvements in recognition accuracy and the effectiveness of the proposed two-mask approach and incremental processing, enhancing the robustness of the proposed RESF pipeline in real-world HRI scenarios.

oversubtraction, speech, speech enhancement, (14 more...)

arXiv.org Artificial Intelligence

2409.06274

Country:

Europe > Netherlands > North Holland > Amsterdam (0.05)
Oceania > Australia > Queensland (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback

On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis

Sarkar, Eklavya, -Doss, Mathew Magimai.

arXiv.org Artificial IntelligenceJul-24-2024

Marmoset monkeys encode vital information in their calls and serve as a surrogate model for neuro-biologists to understand the evolutionary origins of human vocal communication. Traditionally analyzed with signal processing-based features, recent approaches have utilized self-supervised models pre-trained on human speech for feature extraction, capitalizing on their ability to learn a signal's intrinsic structure independently of its acoustic domain. However, the utility of such foundation models remains unclear for marmoset call analysis in terms of multi-class classification, bandwidth, and pre-training domain. This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks. Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.

bandwidth, representation, vocalization, (14 more...)

arXiv.org Artificial Intelligence

2407.16417

Country:

North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
(2 more...)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Towards Dog Bark Decoding: Leveraging Human Speech Processing for Automated Bark Classification

Abzaliev, Artem, Espinosa, Humberto Pérez, Mihalcea, Rada

arXiv.org Artificial IntelligenceApr-29-2024

Similar to humans, animals make extensive use of verbal and non-verbal forms of communication, including a large range of audio signals. In this paper, we address dog vocalizations and explore the use of self-supervised speech representation models pre-trained on human speech to address dog bark classification tasks that find parallels in human-centered tasks in speech recognition. We specifically address four tasks: dog recognition, breed identification, gender classification, and context grounding. We show that using speech embedding representations significantly improves over simpler classification baselines. Further, we also find that models pre-trained on large human speech acoustics can provide additional performance boosts on several tasks.

identification, recognition, vocalization, (16 more...)

arXiv.org Artificial Intelligence

2404.18739

Country:

North America > United States > Michigan (0.04)
North America > Mexico > Tlaxcala (0.04)
North America > Mexico > Puebla > Puebla (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

AI Scam Calls: How to Protect Yourself, How to Detect

WIREDApr-8-2024, 11:30:00 GMT

You answer a random call from a family member, and they breathlessly explain how there's been a horrible car accident. They need you to send money right now, or they'll go to jail. You can hear the desperation in their voice as they plead for an immediate cash transfer. While it sure sounds like them, and the call came from their number, you feel like something's off. So, you decide to hang up and call them right back.

ai scam call, safe word, scammer, (8 more...)

WIRED

Country: North America > United States > New Hampshire (0.05)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy > Spam Filtering (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback

Single-Channel Robot Ego-Speech Filtering during Human-Robot Interaction

Li, Yue, Hindriks, Koen V, Kunneman, Florian

arXiv.org Artificial IntelligenceMar-5-2024

In this paper, we study how well human speech can automatically be filtered when this overlaps with the voice and fan noise of a social robot, Pepper. We ultimately aim for an HRI scenario where the microphone can remain open when the robot is speaking, enabling a more natural turn-taking scheme where the human can interrupt the robot. To respond appropriately, the robot would need to understand what the interlocutor said in the overlapping part of the speech, which can be accomplished by target speech extraction (TSE). To investigate how well TSE can be accomplished in the context of the popular social robot Pepper, we set out to manufacture a datase composed of a mixture of recorded speech of Pepper itself, its fan noise (which is close to the microphones), and human speech as recorded by the Pepper microphone, in a room with low reverberation and high reverberation. Comparing a signal processing approach, with and without post-filtering, and a convolutional recurrent neural network (CRNN) approach to a state-of-the-art speaker identification-based TSE model, we found that the signal processing approach without post-filtering yielded the best performance in terms of Word Error Rate on the overlapping speech signals with low reverberation, while the CRNN approach is more robust for reverberation. These results show that estimating the human voice in overlapping speech with a robot is possible in real-life application, provided that the room reverberation is low and the human speech has a high volume or high pitch.

robot, speech, speech signal, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3648536.3648539

2403.02918

Country:

North America > United States > Colorado > Boulder County > Boulder (0.15)
Europe > Netherlands > North Holland > Amsterdam (0.05)
Asia > Middle East > Israel (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds

Hagiwara, Masato, Miron, Marius, Liu, Jen-Yu

arXiv.org Artificial IntelligenceFeb-5-2024

Traditionally, bioacoustics has relied on spectrograms and continuous, per-frame audio representations for the analysis of animal sounds, also serving as input to machine learning models. Meanwhile, the International Phonetic Alphabet (IPA) system has provided an interpretable, language-independent method for transcribing human speech sounds. In this paper, we introduce ISPA (Inter-Species Phonetic Alphabet), a precise, concise, and interpretable system designed for transcribing animal sounds into text. We compare acoustics-based and feature-based methods for transcribing and classifying animal sounds, demonstrating their comparable performance with baseline methods utilizing continuous, dense audio representations. By representing animal sounds with text, we effectively treat them as a "foreign language," and we show that established human language ML paradigms and models, such as language models, can be successfully applied to improve performance.

ispa-f, representation, transcription, (15 more...)

arXiv.org Artificial Intelligence

2402.03269

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)

Add feedback