Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach

Poli, Maxime, Chemla, Emmanuel, Dupoux, Emmanuel

Oct-30-2024–arXiv.org Artificial Intelligence

Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal vocalizations. Modeling directly from speech opens up the path to more natural and expressive systems. On the other hand, speech-only systems require up to three orders of magnitude more data to catch up to their text-based counterparts in terms of their semantic abilities. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations, and language models trained on these units achieve comparable lexical comprehension to ones trained on hundred times more data.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

Oct-30-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Mexico > Mexico City (0.14)
  - United States (0.46)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.47)
  - Natural Language > Chatbot (0.63)
  - Speech (1.00)