SENSE models: an open source solution for multilingual and multimodal semantic-based tasks
Mdhaffar, Salima, Elleuch, Haroun, Chellaf, Chaimae, Nguyen, Ha, Estève, Yannick
–arXiv.org Artificial Intelligence
Abstract--This paper introduces SENSE (Shared Embedding for N-lingual Speech and tExt), an open-source solution inspired by the SAMU-XLSR framework and conceptually similar to Meta AI's SONAR models. These approaches rely on a teacher-student framework to align a self-supervised speech encoder with the language-agnostic continuous representations of a text encoder at the utterance level. We describe how the original SAMU-XLSR method has been updated by selecting a stronger teacher text model and a better initial speech encoder . The source code for training and using SENSE models has been integrated into the SpeechBrain toolkit, and the first SENSE model we trained has been publicly released. We report experimental results on multilingual and multimodal semantic tasks, where our SENSE model achieves highly competitive performance. Finally, this study offers new insights into how semantics are captured in such semantically aligned speech encoders. Speech foundation models based on self-supervised learning (SSL) have brought significant advances in speech processing. These models, such as wav2vec 2.0 [1], HuBERT [2], and WavLM [3], generate learned speech representations that can be applied to a wide range of downstream speech processing tasks. By training on large amounts of unlabelled speech data, SSL models have demonstrated the ability to capture crucial speech features, such as phonemes and other acoustic units [4]. This capability has led to significant progress in multiple downstream tasks, including speech recognition [1], speech translation [5], speech separation, speaker verification, speaker diarization [3], and emotion detection [6]. Different approaches have been proposed to pretrain model by aligning speech and text, like mSLAM [7], a Massively multilingual joint pre-training for speech and text.
arXiv.org Artificial Intelligence
Dec-10-2025
- Country:
- Africa
- Mayotte (0.04)
- Middle East > Tunisia
- Tunis Governorate > Tunis (0.04)
- Europe
- Africa
- Genre:
- Research Report (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language
- Machine Translation (1.00)
- Text Processing (1.00)
- Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence