AITopics

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Neural Information Processing SystemsFeb-9-2026, 06:13:01 GMT

5b5618e7d061748267d74478b7c5b1ab-Supplemental-Conference.pdf

Wav2vec2.0-large is a speech model pre-trained on the audio data from LibriVox (LV-60k) [5] in a self-supervised manner [6]. In this work, we use the Wav2vec2.0-large The hidden dimension, inner dimension, and number of attention heads in each transformer block are 1024, 4096 and 16, respectively. The pre-trained model is fine-tuned on Librispeech's 100 hour clean subset using standard Connectionist Temporal Classification (CTC) loss. We follow the implementation and settings from HuggingFace Transformer [7] for the fine-tuning.

artificial intelligence, machine learning, minmax fp32, (13 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.30)

Neural Information Processing SystemsFeb-9-2026, 06:12:58 GMT

5b5618e7d061748267d74478b7c5b1ab-Paper-Conference.pdf

arxiv preprint arxiv, quantization, transformer model, (12 more...)

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States (0.04)

Genre: Research Report (0.95)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.69)

Attia, Ahmed Adel, Liu, Jing, Wilson, Carol Espy

Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion

arXiv.org Artificial IntelligenceOct-13-2025

ABSTRACT Prior works have investigated the use of articulatory features as complementary representations for automatic speech recognition (ASR), but their use was largely confined to shallow acoustic models. In this work, we revisit articulatory information in the era of deep learning and propose a framework that leverages articulatory representations both as an auxiliary task and as a pseudo-input to the recognition model. Specifically, we employ speech inversion as an auxiliary prediction task, and the predicted articulatory features are injected into the model as a query stream in a cross-attention module with acoustic embeddings as keys and values. Experiments on LibriSpeech demonstrate that our approach yields consistent improvements over strong transformer-based baselines, particularly under low-resource conditions. These findings suggest that articulatory features, once sidelined in ASR research, can provide meaningful benefits when reintroduced with modern architectures.

artificial intelligence, machine learning, representation, (15 more...)

2510.08585

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Neural Information Processing SystemsAug-15-2025, 02:17:42 GMT

Deep Compression of Pre-trained Transformer Models

Due to their excellent computational efficiency and scalability, transformer models can be trained on exceedingly large amounts of data at the expense of tremendous growth in model size.

arxiv preprint arxiv, machine learning, natural language, (16 more...)

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States (0.04)

Genre: Research Report (0.95)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.69)

Mensah, Mark Atta, Wiafe, Isaac, Ekpezu, Akon, Appati, Justice Kwame, Abdulai, Jamal-Deen, Wiafe-Akenten, Akosua Nyarkoa, Yeboah, Frank Ernest, Odame, Gifty

Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability

arXiv.org Artificial IntelligenceJul-4-2025

Most existing automatic speech recognition (ASR) research evaluate models using in-domain datasets. However, they seldom evaluate how they generalize across diverse speech contexts. This study addresses this gap by benchmarking seven Akan ASR models built on transformer architectures, such as Whisper and Wav2Vec2, using four Akan speech corpora to determine their performance. These datasets encompass various domains, including culturally relevant image descriptions, informal conversations, biblical scripture readings, and spontaneous financial dialogues. A comparison of the word error rate and character error rate highlighted domain dependency, with models performing optimally only within their training domains while showing marked accuracy degradation in mismatched scenarios. This study also identified distinct error behaviors between the Whisper and Wav2Vec2 architectures. Whereas fine-tuned Whisper Akan models led to more fluent but potentially misleading transcription errors, Wav2Vec2 produced more obvious yet less interpretable outputs when encountering unfamiliar inputs. This trade-off between readability and transparency in ASR errors should be considered when selecting architectures for low-resource language (LRL) applications. These findings highlight the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for Akan and other LRLs.

artificial intelligence, machine learning, natural language, (19 more...)

2507.02407

Country:

North America > United States (0.68)
Asia > Middle East > UAE (0.28)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceJan-16-2025

Teaching Wav2Vec2 the Language of the Brain

Fiedler, Tobias, Hermann, Leon, Müller, Florian, Cohen, Sarel, Chin, Peter, Friedrich, Tobias, Vaadia, Eilon

The decoding of continuously spoken speech from neuronal activity has the potential to become an important clinical solution for paralyzed patients. Deep Learning Brain Computer Interfaces (BCIs) have recently successfully mapped neuronal activity to text contents in subjects who attempted to formulate speech. However, only small BCI datasets are available. In contrast, labeled data and pre-trained models for the closely related task of speech recognition from audio are widely available. One such model is Wav2Vec2 which has been trained in a self-supervised fashion to create meaningful representations of speech audio data. In this study, we show that patterns learned by Wav2Vec2 are transferable to brain data. Specifically, we replace its audio feature extractor with an untrained Brain Feature Extractor (BFE) model. We then execute full fine-tuning with pre-trained weights for Wav2Vec2, training ''from scratch'' without pre-trained weights as well as freezing a pre-trained Wav2Vec2 and training only the BFE each for 45 different BFE architectures. Across these experiments, the best run is from full fine-tuning with pre-trained weights, achieving a Character Error Rate (CER) of 18.54\%, outperforming the best training from scratch run by 20.46\% and that of frozen Wav2Vec2 training by 15.92\% percentage points. These results indicate that knowledge transfer from audio speech recognition to brain decoding is possible and significantly improves brain decoding performance for the same architectures. Related source code is available at https://github.com/tfiedlerdev/Wav2Vec2ForBrain.

architecture, setup, wav2vec2, (14 more...)

2501.09459

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > New Hampshire > Grafton County > Hanover (0.04)
Europe > Germany > Brandenburg > Potsdam (0.04)
(2 more...)

Genre: Research Report > New Finding (0.49)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Jafarzadeh, Pourya, Rostami, Amir Mohammad, Choobdar, Padideh

Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT

arXiv.org Artificial IntelligenceNov-6-2024

Speech is the most natural way of expressing ourselves as humans. Identifying emotion from speech is a nontrivial task due to the ambiguous definition of emotion itself. Speaker Emotion Recognition (SER) is essential for understanding human emotional behavior. The SER task is challenging due to the variety of speakers, background noise, complexity of emotions, and speaking styles. It has many applications in education, healthcare, customer service, and Human-Computer Interaction (HCI). Previously, conventional machine learning methods such as SVM, HMM, and KNN have been used for the SER task. In recent years, deep learning methods have become popular, with convolutional neural networks and recurrent neural networks being used for SER tasks. The input of these methods is mostly spectrograms and hand-crafted features. In this work, we study the use of self-supervised transformer-based models, Wav2Vec2 and HuBERT, to determine the emotion of speakers from their voice. The models automatically extract features from raw audio signals, which are then used for the classification task. The proposed solution is evaluated on reputable datasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB. The results show the effectiveness of the proposed method on different datasets. Moreover, the model has been used for real-world applications like call center conversations, and the results demonstrate that the model accurately predicts emotions.

emotion recognition, recognition, speech emotion recognition, (13 more...)

2411.02964

Country:

Asia (0.04)
Europe > United Kingdom > England > Surrey > Guildford (0.04)
Europe > Netherlands > South Holland > Dordrecht (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Moussa, Omer, Klakow, Dietrich, Toneva, Mariya

Improving semantic understanding in speech language models via brain-tuning

arXiv.org Artificial IntelligenceOct-15-2024

Speech language models align with human brain responses to natural language to an impressive degree. However, current models rely heavily on low-level speech features, indicating they lack brain-relevant semantics which limits their utility as model organisms of semantic processing in the brain. In this work, we address this limitation by inducing brain-relevant bias directly into the models via fine-tuning with fMRI recordings of people listening to natural stories-a process we name brain-tuning. After testing it on 3 different pretrained model families, we show that brain-tuning not only improves overall alignment with new brain recordings in semantic language regions, but also reduces the reliance on low-level speech features for this alignment. Excitingly, we further show that brain-tuning leads to 1) consistent improvements in performance on a range of downstream tasks and 2) a representational space with increased semantic preference. Our results provide converging evidence, for the first time, that incorporating brain signals into the training of language models improves the models' semantic understanding. It is an exciting time for the cognitive neuroscience of language with the rise of language models which have been shown to align with (e.g. Researchers aim to use language models as model organisms (Toneva, 2021) of reading and listening in the brain to learn more about the underlying information processing that leads to brain-like representations of language. However, recent work has questioned whether current popular speech language models can serve this role fully, as their alignment with semantic brain regions was shown to be mostly due to lowlevel speech features, indicating that speech language models lack brain-relevant semantics (Oota et al., 2024a). Given that most large public brain recordings datasets are of speech-evoked language (LeBel et al., 2023; Nastase et al., 2021; Deniz et al., 2019; Momenian et al., 2024), having access to speech models with improved brain-relevant semantics is important and will provide better model organisms for auditory language processing. The lack of brain-relevant semantics in speech models (Oota et al., 2024a) may also be related to their incomplete semantic understanding for downstream language tasks (Choi et al., 2024). To bridge the gap between language understanding in speech models and the human brain, we propose to augment pretrained speech model training directly with brain recordings in a process we call brain-tuning (see Figure 1a for illustration of the training approach).

alignment, artificial intelligence, natural language, (18 more...)

2410.0923

Country:

Europe > Germany > Saarland (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)