Goto

Collaborating Authors

 wav2vec2




Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion

Attia, Ahmed Adel, Liu, Jing, Wilson, Carol Espy

arXiv.org Artificial Intelligence

ABSTRACT Prior works have investigated the use of articulatory features as complementary representations for automatic speech recognition (ASR), but their use was largely confined to shallow acoustic models. In this work, we revisit articulatory information in the era of deep learning and propose a framework that leverages articulatory representations both as an auxiliary task and as a pseudo-input to the recognition model. Specifically, we employ speech inversion as an auxiliary prediction task, and the predicted articulatory features are injected into the model as a query stream in a cross-attention module with acoustic embeddings as keys and values. Experiments on LibriSpeech demonstrate that our approach yields consistent improvements over strong transformer-based baselines, particularly under low-resource conditions. These findings suggest that articulatory features, once sidelined in ASR research, can provide meaningful benefits when reintroduced with modern architectures.




Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability

Mensah, Mark Atta, Wiafe, Isaac, Ekpezu, Akon, Appati, Justice Kwame, Abdulai, Jamal-Deen, Wiafe-Akenten, Akosua Nyarkoa, Yeboah, Frank Ernest, Odame, Gifty

arXiv.org Artificial Intelligence

Most existing automatic speech recognition (ASR) research evaluate models using in-domain datasets. However, they seldom evaluate how they generalize across diverse speech contexts. This study addresses this gap by benchmarking seven Akan ASR models built on transformer architectures, such as Whisper and Wav2Vec2, using four Akan speech corpora to determine their performance. These datasets encompass various domains, including culturally relevant image descriptions, informal conversations, biblical scripture readings, and spontaneous financial dialogues. A comparison of the word error rate and character error rate highlighted domain dependency, with models performing optimally only within their training domains while showing marked accuracy degradation in mismatched scenarios. This study also identified distinct error behaviors between the Whisper and Wav2Vec2 architectures. Whereas fine-tuned Whisper Akan models led to more fluent but potentially misleading transcription errors, Wav2Vec2 produced more obvious yet less interpretable outputs when encountering unfamiliar inputs. This trade-off between readability and transparency in ASR errors should be considered when selecting architectures for low-resource language (LRL) applications. These findings highlight the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for Akan and other LRLs.


Teaching Wav2Vec2 the Language of the Brain

Fiedler, Tobias, Hermann, Leon, Müller, Florian, Cohen, Sarel, Chin, Peter, Friedrich, Tobias, Vaadia, Eilon

arXiv.org Artificial Intelligence

The decoding of continuously spoken speech from neuronal activity has the potential to become an important clinical solution for paralyzed patients. Deep Learning Brain Computer Interfaces (BCIs) have recently successfully mapped neuronal activity to text contents in subjects who attempted to formulate speech. However, only small BCI datasets are available. In contrast, labeled data and pre-trained models for the closely related task of speech recognition from audio are widely available. One such model is Wav2Vec2 which has been trained in a self-supervised fashion to create meaningful representations of speech audio data. In this study, we show that patterns learned by Wav2Vec2 are transferable to brain data. Specifically, we replace its audio feature extractor with an untrained Brain Feature Extractor (BFE) model. We then execute full fine-tuning with pre-trained weights for Wav2Vec2, training ''from scratch'' without pre-trained weights as well as freezing a pre-trained Wav2Vec2 and training only the BFE each for 45 different BFE architectures. Across these experiments, the best run is from full fine-tuning with pre-trained weights, achieving a Character Error Rate (CER) of 18.54\%, outperforming the best training from scratch run by 20.46\% and that of frozen Wav2Vec2 training by 15.92\% percentage points. These results indicate that knowledge transfer from audio speech recognition to brain decoding is possible and significantly improves brain decoding performance for the same architectures. Related source code is available at https://github.com/tfiedlerdev/Wav2Vec2ForBrain.


Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT

Jafarzadeh, Pourya, Rostami, Amir Mohammad, Choobdar, Padideh

arXiv.org Artificial Intelligence

Speech is the most natural way of expressing ourselves as humans. Identifying emotion from speech is a nontrivial task due to the ambiguous definition of emotion itself. Speaker Emotion Recognition (SER) is essential for understanding human emotional behavior. The SER task is challenging due to the variety of speakers, background noise, complexity of emotions, and speaking styles. It has many applications in education, healthcare, customer service, and Human-Computer Interaction (HCI). Previously, conventional machine learning methods such as SVM, HMM, and KNN have been used for the SER task. In recent years, deep learning methods have become popular, with convolutional neural networks and recurrent neural networks being used for SER tasks. The input of these methods is mostly spectrograms and hand-crafted features. In this work, we study the use of self-supervised transformer-based models, Wav2Vec2 and HuBERT, to determine the emotion of speakers from their voice. The models automatically extract features from raw audio signals, which are then used for the classification task. The proposed solution is evaluated on reputable datasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB. The results show the effectiveness of the proposed method on different datasets. Moreover, the model has been used for real-world applications like call center conversations, and the results demonstrate that the model accurately predicts emotions.


Improving semantic understanding in speech language models via brain-tuning

Moussa, Omer, Klakow, Dietrich, Toneva, Mariya

arXiv.org Artificial Intelligence

Speech language models align with human brain responses to natural language to an impressive degree. However, current models rely heavily on low-level speech features, indicating they lack brain-relevant semantics which limits their utility as model organisms of semantic processing in the brain. In this work, we address this limitation by inducing brain-relevant bias directly into the models via fine-tuning with fMRI recordings of people listening to natural stories-a process we name brain-tuning. After testing it on 3 different pretrained model families, we show that brain-tuning not only improves overall alignment with new brain recordings in semantic language regions, but also reduces the reliance on low-level speech features for this alignment. Excitingly, we further show that brain-tuning leads to 1) consistent improvements in performance on a range of downstream tasks and 2) a representational space with increased semantic preference. Our results provide converging evidence, for the first time, that incorporating brain signals into the training of language models improves the models' semantic understanding. It is an exciting time for the cognitive neuroscience of language with the rise of language models which have been shown to align with (e.g. Researchers aim to use language models as model organisms (Toneva, 2021) of reading and listening in the brain to learn more about the underlying information processing that leads to brain-like representations of language. However, recent work has questioned whether current popular speech language models can serve this role fully, as their alignment with semantic brain regions was shown to be mostly due to lowlevel speech features, indicating that speech language models lack brain-relevant semantics (Oota et al., 2024a). Given that most large public brain recordings datasets are of speech-evoked language (LeBel et al., 2023; Nastase et al., 2021; Deniz et al., 2019; Momenian et al., 2024), having access to speech models with improved brain-relevant semantics is important and will provide better model organisms for auditory language processing. The lack of brain-relevant semantics in speech models (Oota et al., 2024a) may also be related to their incomplete semantic understanding for downstream language tasks (Choi et al., 2024). To bridge the gap between language understanding in speech models and the human brain, we propose to augment pretrained speech model training directly with brain recordings in a process we call brain-tuning (see Figure 1a for illustration of the training approach).


Exploring ASR-Based Wav2Vec2 for Automated Speech Disorder Assessment: Insights and Analysis

Nguyen, Tuan, Fredouille, Corinne, Ghio, Alain, Balaguer, Mathieu, Woisard, Virginie

arXiv.org Artificial Intelligence

Some automatic systems have ASR-based model has been fine-tuned for automated speech shown robust performance and stability by learning from expert disorder quality assessment tasks, yielding impressive results decisions [6, 7]. and setting a new baseline for Head and Neck Cancer speech contexts. This demonstrates that the ASR dimension from In 2024, Nguyen et al. [8] introduced a system that Wav2Vec2 closely aligns with assessment dimensions. Despite leverages the Automatic Speech Recognition (ASR) based its effectiveness, this system remains a black box with Wav2Vec2 model [9], known for its strong capability in no clear interpretation of the connection between the model learning speech representations. This approach compared ASR dimension and clinical assessments. This paper presents self-supervised learning (SSL) and the ASR dimension for the first analysis of this baseline model for speech quality assessment, speech quality assessment. It is shown that the fine-tuning focusing on intelligibility and severity tasks.