Goto

Collaborating Authors

 wav2v ec 2



Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data

arXiv.org Artificial Intelligence

Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. These findings challenge the prevailing assumption that ASR quality scales primarily with model size, revealing instead that data relevance and strategic pretraining are more critical factors for low-resource scenarios. This work provides a practical pathway toward inclusive speech technology, enabling effective ASR for underrepresented languages without dependence on massive computational infrastructure or proprietary datasets.


Dynamic Stress Detection: A Study of Temporal Progression Modelling of Stress in Speech

arXiv.org Artificial Intelligence

Abstract--Detecting psychological stress from speech is critical in high-pressure settings. While prior work has leveraged acoustic features for stress detection, most treat stress as a static label. In this work, we model stress as a temporally evolving phenomenon influenced by historical emotional state. We propose a dynamic labelling strategy that derives fine-grained stress annotations from emotional labels and introduce cross-attention-based sequential models--a Unidirectional LSTM and a Transformer Encoder--to capture temporal stress progression. Our approach achieves notable accuracy gains on MuSE (+5%) and StressID (+18%) over existing baselines, and generalises well to a custom real-world dataset. These results highlight the value of modelling stress as a dynamic construct in speech.


Sounding Like a Winner? Prosodic Differences in Post-Match Interviews

arXiv.org Artificial Intelligence

This study examines the prosodic characteristics associated with winning and losing in post-match tennis interviews. Additionally, this research explores the potential to classify match outcomes solely based on post-match interview recordings using prosodic features and self-supervised learning (SSL) representations. By analyzing prosodic elements such as pitch and intensity, alongside SSL models like Wav2Vec 2.0 and HuBERT, the aim is to determine whether an athlete has won or lost their match. Traditional acoustic features and deep speech representations are extracted from the data, and machine learning classifiers are employed to distinguish between winning and losing players. Results indicate that SSL representations effectively differentiate between winning and losing outcomes, capturing subtle speech patterns linked to emotional states. At the same time, prosodic cues -- such as pitch variability -- remain strong indicators of victory.


Semantic Matters: Multimodal Features for Affective Analysis

arXiv.org Artificial Intelligence

In this study, we present our methodology for two tasks: the Emotional Mimicry Intensity (EMI) Estimation Challenge and the Behavioural Ambivalence/Hesitancy (BAH) Recognition Challenge, both conducted as part of the 8th Workshop and Competition on Affective & Behavior Analysis in-the-wild. We utilize a Wav2Vec 2.0 model pre-trained on a large podcast dataset to extract various audio features, capturing both linguistic and paralinguistic information. Our approach incorporates a valence-arousal-dominance (VAD) module derived from Wav2Vec 2.0, a BERT text encoder, and a vision transformer (ViT) with predictions subsequently processed through a long short-term memory (LSTM) architecture or a convolution-like method for temporal modeling. We integrate the textual and visual modality into our analysis, recognizing that semantic content provides valuable contextual cues and underscoring that the meaning of speech often conveys more critical insights than its acoustic counterpart alone. Fusing in the vision modality helps in some cases to interpret the textual modality more precisely. This combined approach results in significant performance improvements, achieving in EMI $ฯ_{\text{TEST}} = 0.706$ and in BAH $F1_{\text{TEST}} = 0.702$, securing first place in the EMI challenge and second place in the BAH challenge.


Keyword-Oriented Multimodal Modeling for Euphemism Identification

arXiv.org Artificial Intelligence

Euphemism identification deciphers the true meaning of euphemisms, such as linking "weed" (euphemism) to "marijuana" (target keyword) in illicit texts, aiding content moderation and combating underground markets. While existing methods are primarily text-based, the rise of social media highlights the need for multimodal analysis, incorporating text, images, and audio. However, the lack of multimodal datasets for euphemisms limits further research. To address this, we regard euphemisms and their corresponding target keywords as keywords and first introduce a keyword-oriented multimodal corpus of euphemisms (KOM-Euph), involving three datasets (Drug, Weapon, and Sexuality), including text, images, and speech. We further propose a keyword-oriented multimodal euphemism identification method (KOM-EI), which uses cross-modal feature alignment and dynamic fusion modules to explicitly utilize the visual and audio features of the keywords for efficient euphemism identification. Extensive experiments demonstrate that KOM-EI outperforms state-of-the-art models and large language models, and show the importance of our multimodal datasets.


Efficient Finetuning for Dimensional Speech Emotion Recognition in the Age of Transformers

arXiv.org Artificial Intelligence

Accurate speech emotion recognition is essential for developing human-facing systems. Recent advancements have included finetuning large, pretrained transformer models like Wav2Vec 2.0. However, the finetuning process requires substantial computational resources, including high-memory GPUs and significant processing time. As the demand for accurate emotion recognition continues to grow, efficient finetuning approaches are needed to reduce the computational burden. Our study focuses on dimensional emotion recognition, predicting attributes such as activation (calm to excited) and valence (negative to positive). We present various finetuning techniques, including full finetuning, partial finetuning of transformer layers, finetuning with mixed precision, partial finetuning with caching, and low-rank adaptation (LoRA) on the Wav2Vec 2.0 base model. We find that partial finetuning with mixed precision achieves performance comparable to full finetuning while increasing training speed by 67%. Caching intermediate representations further boosts efficiency, yielding an 88% speedup and a 71% reduction in learnable parameters. We recommend finetuning the final three transformer layers in mixed precision to balance performance and training efficiency, and adding intermediate representation caching for optimal speed with minimal performance trade-offs. These findings lower the barriers to finetuning speech emotion recognition systems, making accurate emotion recognition more accessible to a broader range of researchers and practitioners.


Quantifying the Role of Textual Predictability in Automatic Speech Recognition

arXiv.org Artificial Intelligence

A long-standing question in automatic speech recognition research is how to attribute errors to the ability of a model to model the acoustics, versus its ability to leverage higher-order context (lexicon, morphology, syntax, semantics). We validate a novel approach which models error rates as a function of relative textual predictability, and yields a single number, $k$, which measures the effect of textual predictability on the recognizer. We use this method to demonstrate that a Wav2Vec 2.0-based model makes greater stronger use of textual context than a hybrid ASR model, in spite of not using an explicit language model, and also use it to shed light on recent results demonstrating poor performance of standard ASR systems on African-American English. We demonstrate that these mostly represent failures of acoustic--phonetic modelling. We show how this approach can be used straightforwardly in diagnosing and improving ASR.