Text-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels
Cuervo, Santiago, Moumen, Adel, Labrak, Yanis, Khurana, Sameer, Laurent, Antoine, Rouvier, Mickael, Marxer, Ricard
–arXiv.org Artificial Intelligence
Text-Speech Language Models (TSLMs) -- language models trained to jointly process and generate text and speech -- aim to enable cross-modal knowledge transfer to overcome the scaling limitations of unimodal speech LMs. The predominant approach to TSLM training expands the vocabulary of a pre-trained text LM by appending new embeddings and linear projections for speech, followed by fine-tuning on speech data. We hypothesize that this method limits cross-modal transfer by neglecting feature compositionality, preventing text-learned functions from being fully leveraged at appropriate abstraction levels. To address this, we propose augmenting vocabulary expansion with modules that better align abstraction levels across layers. Our models, \textsc{SmolTolk}, rival or surpass state-of-the-art TSLMs trained with orders of magnitude more compute. Representation analyses and improved multimodal performance suggest our method enhances cross-modal transfer.
arXiv.org Artificial Intelligence
Mar-8-2025
- Country:
- South America > Chile
- North America > United States
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Pennsylvania > Philadelphia County
- Europe
- Germany > Berlin (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Asia
- Genre:
- Research Report > New Finding (1.00)
- Technology: