AITopics | speech perception

Collaborating Authors

speech perception

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LibriBrain: Over 50 Hours of Within-Subject MEG to Improve Speech Decoding Methods at Scale

Neural Information Processing SystemsJun-15-2026, 02:41:59 GMT

LibriBrain represents the largest single-subject MEG dataset to date for speech decoding, with over 50 hours of recordings--5 larger than the next comparable dataset and 50 larger than most. This unprecedented'depth' of within-subject data enables exploration of neural representations at a scale previously unavailable with non-invasive methods. LibriBrain comprises high-quality MEG recordings together with detailed annotations from a single participant listening to naturalistic spoken English, covering nearly the full Sherlock Holmes canon. Designed to support advances in neural decoding, LibriBrain comes with a Python library for streamlined integration with deep learning frameworks, standard data splits for reproducibility, and baseline results for three foundational decoding tasks: speech detection, phoneme classification, and word classification. Baseline experiments demonstrate that increasing training data yields substantial improvements in decoding performance, highlighting the value of scaling up deep, within-subject datasets. By releasing this dataset, we aim to empower the research community to advance speech decoding methodologies and accelerate the development of safe, effective clinical brain-computer interfaces.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country: North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Aligning Brain Signals with Multimodal Speech and Vision Embeddings

Shapovalenko, Kateryna, Auster, Quentin

arXiv.org Artificial IntelligenceNov-11-2025

When we hear the word "house", we don't just process sound, we imagine walls, doors, memories. The brain builds meaning through layers, moving from raw acoustics to rich, multimodal associations. Inspired by this, we build on recent work from Meta that aligned EEG signals with averaged wav2vec2 speech embeddings, and ask a deeper question: which layers of pre-trained models best reflect this layered processing in the brain? We compare embeddings from two models: wav2vec2, which encodes sound into language, and CLIP, which maps words to images. Using EEG recorded during natural speech perception, we evaluate how these embeddings align with brain activity using ridge regression and contrastive decoding. We test three strategies: individual layers, progressive concatenation, and progressive summation. The findings suggest that combining multimodal, layer-aware representations may bring us closer to decoding how the brain understands language, not just as sound, but as experience.

artificial intelligence, machine learning, representation, (16 more...)

arXiv.org Artificial Intelligence

2511.00065

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.36)

Add feedback

LibriBrain: Over 50 Hours of Within-Subject MEG to Improve Speech Decoding Methods at Scale

Özdogan, Miran, Landau, Gilad, Elvers, Gereon, Jayalath, Dulhan, Somaiya, Pratik, Mantegna, Francesco, Woolrich, Mark, Jones, Oiwi Parker

arXiv.org Artificial IntelligenceJun-4-2025

LibriBrain represents the largest single-subject MEG dataset to date for speech decoding, with over 50 hours of recordings -- 5$\times$ larger than the next comparable dataset and 50$\times$ larger than most. This unprecedented `depth' of within-subject data enables exploration of neural representations at a scale previously unavailable with non-invasive methods. LibriBrain comprises high-quality MEG recordings together with detailed annotations from a single participant listening to naturalistic spoken English, covering nearly the full Sherlock Holmes canon. Designed to support advances in neural decoding, LibriBrain comes with a Python library for streamlined integration with deep learning frameworks, standard data splits for reproducibility, and baseline results for three foundational decoding tasks: speech detection, phoneme classification, and word classification. Baseline experiments demonstrate that increasing training data yields substantial improvements in decoding performance, highlighting the value of scaling up deep, within-subject datasets. By releasing this dataset, we aim to empower the research community to advance speech decoding methodologies and accelerate the development of safe, effective clinical brain-computer interfaces.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.02098

Country: North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models

Lu, Ke-Han, Kuan, Chun-Yi, Lee, Hung-yi

arXiv.org Artificial IntelligenceMay-27-2025

We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities and quantify catastrophic forgetting in speech-aware language models (SLMs). Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training. Existing benchmarks conflate speech perception with instruction-following, hindering evaluation of these distinct skills. To address this gap, we provide a benchmark for diagnosing the instruction-following abilities of SLMs. Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs. Additionally, these models are highly sensitive to prompt variations, often yielding inconsistent and unreliable outputs. We highlight core challenges and provide insights to guide future research, emphasizing the need for evaluation beyond task-level metrics.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.19037

Country: Asia > Thailand (0.15)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Enhancing nonnative speech perception and production through an AI-powered application

Georgiou, Georgios P.

arXiv.org Artificial IntelligenceMar-18-2025

While research on using Artificial Intelligence (AI) through various applications to enhance foreign language pronunciation is expanding, it has primarily focused on aspects such as comprehensibility and intelligibility, largely neglecting the improvement of individual speech sounds in both perception and production. This study seeks to address this gap by examining the impact of training with an AI-powered mobile application on nonnative sound perception and production. Participants completed a pretest assessing their ability to discriminate the second language English heed-hid contrast and produce these vowels in sentence contexts. The intervention involved training with the Speakometer mobile application, which incorporated recording tasks featuring the English vowels, along with pronunciation feedback and practice. The posttest mirrored the pretest to measure changes in performance. The results revealed significant improvements in both discrimination accuracy and production of the target contrast following the intervention. However, participants did not achieve native-like competence. These findings highlight the effectiveness of AI-powered applications in facilitating speech acquisition and support their potential use for personalized, interactive pronunciation training beyond the classroom.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.22705

Country:

Europe > Austria > Vienna (0.14)
Europe > Middle East > Cyprus > Nicosia > Nicosia (0.05)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education > Educational Setting (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Speaker effects in spoken language comprehension

Wu, Hanlin, Cai, Zhenguang G.

arXiv.org Artificial IntelligenceDec-10-2024

The identity of a speaker significantly influences spoken language comprehension by affecting both perception and expectation. This review explores speaker effects, focusing on how speaker information impacts language processing. We propose an integrative model featuring the interplay between bottom-up perception-based processes driven by acoustic details and top-down expectation-based processes driven by a speaker model. The acoustic details influence lower-level perception, while the speaker model modulates both lower-level and higher-level processes such as meaning interpretation and pragmatic inferences. We define speaker-idiosyncrasy and speaker-demographics effects and demonstrate how bottom-up and top-down processes interact at various levels in different scenarios. This framework contributes to psycholinguistic theory by offering a comprehensive account of how speaker information interacts with linguistic content to shape message construction. We suggest that speaker effects can serve as indices of a language learner's proficiency and an individual's characteristics of social cognition. We encourage future research to extend these findings to AI speakers, probing the universality of speaker effects across humans and artificial agents.

artificial intelligence, comprehension, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.07238

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Pennsylvania (0.04)
Asia > Indonesia (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.68)
Media (0.67)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
(2 more...)

Add feedback

Speech perception: a model of word recognition

Luck, Jean-Marc, Mehta, Anita

arXiv.org Artificial IntelligenceOct-24-2024

We present a model of speech perception which takes into account effects of correlations between sounds. Words in this model correspond to the attractors of a suitably chosen descent dynamics. The resulting lexicon is rich in short words, and much less so in longer ones, as befits a reasonable word length distribution. We separately examine the decryption of short and long words in the presence of mishearings. In the regime of short words, the algorithm either quickly retrieves a word, or proposes another valid word. In the regime of longer words, the behaviour is markedly different. While the successful decryption of words continues to be relatively fast, there is a finite probability of getting lost permanently, as the algorithm wanders round the landscape of suitable words without ever settling on one.

artificial intelligence, attractor, speech perception, (15 more...)

arXiv.org Artificial Intelligence

2410.1859

Country:

North America > United States > New York (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Weinheim (0.04)
Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)

Genre: Research Report (0.63)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.46)

Add feedback

Transfer Learning from Whisper for Microscopic Intelligibility Prediction

Best, Paul, Cuervo, Santiago, Marxer, Ricard

arXiv.org Artificial IntelligenceApr-2-2024

Macroscopic intelligibility models predict the expected human word-error-rate for a given speech-in-noise stimulus. In contrast, microscopic intelligibility models aim to make fine-grained predictions about listeners' perception, e.g. predicting phonetic or lexical responses. State-of-the-art macroscopic models use transfer learning from large scale deep learning models for speech processing, whereas such methods have rarely been used for microscopic modeling. In this paper, we study the use of transfer learning from Whisper, a state-of-the-art deep learning model for automatic speech recognition, for microscopic intelligibility prediction at the level of lexical responses. Our method outperforms the considered baselines, even in a zero-shot setup, and yields a relative improvement of up to 66\% when fine-tuned to predict listeners' responses. Our results showcase the promise of large scale deep learning based methods for microscopic intelligibility prediction.

intelligibility prediction, listener, prediction, (14 more...)

arXiv.org Artificial Intelligence

2404.01737

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Germany > Berlin (0.04)

Genre: Research Report > New Finding (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Towards Decoding Brain Activity During Passive Listening of Speech

Fodor, Milán András, Csapó, Tamás Gábor, Arthur, Frigyes Viktor

arXiv.org Artificial IntelligenceFeb-26-2024

The aim of the study is to investigate the complex mechanisms of speech perception and ultimately decode the electrical changes in the brain accruing while listening to speech. We attempt to decode heard speech from intracranial electroencephalographic (iEEG) data using deep learning methods. The goal is to aid the advancement of brain-computer interface (BCI) technology for speech synthesis, and, hopefully, to provide an additional perspective on the cognitive processes of speech perception. This approach diverges from the conventional focus on speech production and instead chooses to investigate neural representations of perceived speech. This angle opened up a complex perspective, potentially allowing us to study more sophisticated neural patterns. Leveraging the power of deep learning models, the research aimed to establish a connection between these intricate neural activities and the corresponding speech sounds. Despite the approach not having achieved a breakthrough yet, the research sheds light on the potential of decoding neural activity during speech perception. Our current efforts can serve as a foundation, and we are optimistic about the potential of expanding and improving upon this work to move closer towards more advanced BCIs, better understanding of processes underlying perceived speech and its relation to spoken speech.

neural activity, speech, speech perception, (15 more...)

arXiv.org Artificial Intelligence

2402.16996

Country:

Europe > Italy (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.48)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

Do self-supervised speech and language models extract similar representations as human brain?

Chen, Peili, He, Linyang, Fu, Li, Fan, Lu, Chang, Edward F., Li, Yuanning

arXiv.org Artificial IntelligenceJan-31-2024

Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception. However, given their distinct training modalities, it remains unclear whether they correlate with the same neural aspects. We directly address this question by evaluating the brain prediction performance of two representative SSL models, Wav2Vec2.0 and GPT-2, designed for speech and language tasks. Our findings reveal that both models accurately predict speech responses in the auditory cortex, with a significant correlation between their brain predictions. Notably, shared speech contextual information between Wav2Vec2.0 and GPT-2 accounts for the majority of explained variance in brain activity, surpassing static semantic and lower-level acoustic-phonetic information. These results underscore the convergence of speech contextual representations in SSL models and their alignment with the neural network underlying speech perception, offering valuable insights into both SSL models and the neural basis of speech and language processing.

information, representation, wav2vec2, (13 more...)

arXiv.org Artificial Intelligence

2310.04645

Country:

North America > United States > California > San Francisco County > San Francisco (0.28)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
Asia > China > Shanghai > Shanghai (0.05)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback