Goto

Collaborating Authors

 speech perception


Aligning Brain Signals with Multimodal Speech and Vision Embeddings

arXiv.org Artificial Intelligence

When we hear the word "house", we don't just process sound, we imagine walls, doors, memories. The brain builds meaning through layers, moving from raw acoustics to rich, multimodal associations. Inspired by this, we build on recent work from Meta that aligned EEG signals with averaged wav2vec2 speech embeddings, and ask a deeper question: which layers of pre-trained models best reflect this layered processing in the brain? We compare embeddings from two models: wav2vec2, which encodes sound into language, and CLIP, which maps words to images. Using EEG recorded during natural speech perception, we evaluate how these embeddings align with brain activity using ridge regression and contrastive decoding. We test three strategies: individual layers, progressive concatenation, and progressive summation. The findings suggest that combining multimodal, layer-aware representations may bring us closer to decoding how the brain understands language, not just as sound, but as experience.


LibriBrain: Over 50 Hours of Within-Subject MEG to Improve Speech Decoding Methods at Scale

arXiv.org Artificial Intelligence

LibriBrain represents the largest single-subject MEG dataset to date for speech decoding, with over 50 hours of recordings -- 5$\times$ larger than the next comparable dataset and 50$\times$ larger than most. This unprecedented `depth' of within-subject data enables exploration of neural representations at a scale previously unavailable with non-invasive methods. LibriBrain comprises high-quality MEG recordings together with detailed annotations from a single participant listening to naturalistic spoken English, covering nearly the full Sherlock Holmes canon. Designed to support advances in neural decoding, LibriBrain comes with a Python library for streamlined integration with deep learning frameworks, standard data splits for reproducibility, and baseline results for three foundational decoding tasks: speech detection, phoneme classification, and word classification. Baseline experiments demonstrate that increasing training data yields substantial improvements in decoding performance, highlighting the value of scaling up deep, within-subject datasets. By releasing this dataset, we aim to empower the research community to advance speech decoding methodologies and accelerate the development of safe, effective clinical brain-computer interfaces.


Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models

arXiv.org Artificial Intelligence

We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities and quantify catastrophic forgetting in speech-aware language models (SLMs). Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training. Existing benchmarks conflate speech perception with instruction-following, hindering evaluation of these distinct skills. To address this gap, we provide a benchmark for diagnosing the instruction-following abilities of SLMs. Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs. Additionally, these models are highly sensitive to prompt variations, often yielding inconsistent and unreliable outputs. We highlight core challenges and provide insights to guide future research, emphasizing the need for evaluation beyond task-level metrics.


Enhancing nonnative speech perception and production through an AI-powered application

arXiv.org Artificial Intelligence

While research on using Artificial Intelligence (AI) through various applications to enhance foreign language pronunciation is expanding, it has primarily focused on aspects such as comprehensibility and intelligibility, largely neglecting the improvement of individual speech sounds in both perception and production. This study seeks to address this gap by examining the impact of training with an AI-powered mobile application on nonnative sound perception and production. Participants completed a pretest assessing their ability to discriminate the second language English heed-hid contrast and produce these vowels in sentence contexts. The intervention involved training with the Speakometer mobile application, which incorporated recording tasks featuring the English vowels, along with pronunciation feedback and practice. The posttest mirrored the pretest to measure changes in performance. The results revealed significant improvements in both discrimination accuracy and production of the target contrast following the intervention. However, participants did not achieve native-like competence. These findings highlight the effectiveness of AI-powered applications in facilitating speech acquisition and support their potential use for personalized, interactive pronunciation training beyond the classroom.


Speaker effects in spoken language comprehension

arXiv.org Artificial Intelligence

The identity of a speaker significantly influences spoken language comprehension by affecting both perception and expectation. This review explores speaker effects, focusing on how speaker information impacts language processing. We propose an integrative model featuring the interplay between bottom-up perception-based processes driven by acoustic details and top-down expectation-based processes driven by a speaker model. The acoustic details influence lower-level perception, while the speaker model modulates both lower-level and higher-level processes such as meaning interpretation and pragmatic inferences. We define speaker-idiosyncrasy and speaker-demographics effects and demonstrate how bottom-up and top-down processes interact at various levels in different scenarios. This framework contributes to psycholinguistic theory by offering a comprehensive account of how speaker information interacts with linguistic content to shape message construction. We suggest that speaker effects can serve as indices of a language learner's proficiency and an individual's characteristics of social cognition. We encourage future research to extend these findings to AI speakers, probing the universality of speaker effects across humans and artificial agents.


Speech perception: a model of word recognition

arXiv.org Artificial Intelligence

We present a model of speech perception which takes into account effects of correlations between sounds. Words in this model correspond to the attractors of a suitably chosen descent dynamics. The resulting lexicon is rich in short words, and much less so in longer ones, as befits a reasonable word length distribution. We separately examine the decryption of short and long words in the presence of mishearings. In the regime of short words, the algorithm either quickly retrieves a word, or proposes another valid word. In the regime of longer words, the behaviour is markedly different. While the successful decryption of words continues to be relatively fast, there is a finite probability of getting lost permanently, as the algorithm wanders round the landscape of suitable words without ever settling on one.


Transfer Learning from Whisper for Microscopic Intelligibility Prediction

arXiv.org Artificial Intelligence

Macroscopic intelligibility models predict the expected human word-error-rate for a given speech-in-noise stimulus. In contrast, microscopic intelligibility models aim to make fine-grained predictions about listeners' perception, e.g. predicting phonetic or lexical responses. State-of-the-art macroscopic models use transfer learning from large scale deep learning models for speech processing, whereas such methods have rarely been used for microscopic modeling. In this paper, we study the use of transfer learning from Whisper, a state-of-the-art deep learning model for automatic speech recognition, for microscopic intelligibility prediction at the level of lexical responses. Our method outperforms the considered baselines, even in a zero-shot setup, and yields a relative improvement of up to 66\% when fine-tuned to predict listeners' responses. Our results showcase the promise of large scale deep learning based methods for microscopic intelligibility prediction.


Towards Decoding Brain Activity During Passive Listening of Speech

arXiv.org Artificial Intelligence

The aim of the study is to investigate the complex mechanisms of speech perception and ultimately decode the electrical changes in the brain accruing while listening to speech. We attempt to decode heard speech from intracranial electroencephalographic (iEEG) data using deep learning methods. The goal is to aid the advancement of brain-computer interface (BCI) technology for speech synthesis, and, hopefully, to provide an additional perspective on the cognitive processes of speech perception. This approach diverges from the conventional focus on speech production and instead chooses to investigate neural representations of perceived speech. This angle opened up a complex perspective, potentially allowing us to study more sophisticated neural patterns. Leveraging the power of deep learning models, the research aimed to establish a connection between these intricate neural activities and the corresponding speech sounds. Despite the approach not having achieved a breakthrough yet, the research sheds light on the potential of decoding neural activity during speech perception. Our current efforts can serve as a foundation, and we are optimistic about the potential of expanding and improving upon this work to move closer towards more advanced BCIs, better understanding of processes underlying perceived speech and its relation to spoken speech.


Do self-supervised speech and language models extract similar representations as human brain?

arXiv.org Artificial Intelligence

Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception. However, given their distinct training modalities, it remains unclear whether they correlate with the same neural aspects. We directly address this question by evaluating the brain prediction performance of two representative SSL models, Wav2Vec2.0 and GPT-2, designed for speech and language tasks. Our findings reveal that both models accurately predict speech responses in the auditory cortex, with a significant correlation between their brain predictions. Notably, shared speech contextual information between Wav2Vec2.0 and GPT-2 accounts for the majority of explained variance in brain activity, surpassing static semantic and lower-level acoustic-phonetic information. These results underscore the convergence of speech contextual representations in SSL models and their alignment with the neural network underlying speech perception, offering valuable insights into both SSL models and the neural basis of speech and language processing.


Language Dynamics: Sound Categorization

AAAI Conferences

A form of categorical perception occurs constantly outside the laboratory, as when different The history of research on speech perception is speakers produce the "same" word or when a speaker says replete with examples of nonlinearities, or threshold the "same" word quickly or slowly. This means that phenomena, relating acoustics to perception. These speech perception cannot be a simple concatenation of nonlinearities are essential in that they allow stable sound elements to yield syllables, syllables to yield communication despite variation in the acoustic signal words, or words to yield sentences. The interdependency across speakers, emphasis, background noise, etc. across scales reveals a complex system with nonlinearly Furthermore, the range of acoustic signals perceived as interacting elements that somehow allow veridical equivalent is much larger for speech sounds than for communication.