Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

Guichoux, Téo, Lemerle, Théodor, Mehta, Shivam, Beskow, Jonas, Henter, Gustav Eje, Soulier, Laure, Pelachaud, Catherine, Obin, Nicolas

Dec-1-2025–arXiv.org Artificial Intelligence

Early approaches used au-toregressive sequence modeling to map speech or text to motion sequences [19, 9], while diffusion-based generators now dominate for their ability to produce detailed, temporally consistent, and natural gestures [12, 10]. Other works explore discrete motion representations, enabling more controllable synthesis [8]. These models accept either speech or text as input and typically rely on speaker embeddings for multi-speaker modeling, which limits their generalization ability to speakers unseen during training. In contrast, Gelina generates both speech and gestures directly from text, and can also clone voice and gestural style through sequence continuation using a speech-gesture prompt, without relying on speaker embeddings. T ext-to-speech approaches: Lately, TTS has shifted toward data-driven methods, with notable advances in discrete code modeling [4, 5, 6].

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Dec-1-2025

arXiv.org PDF

Add feedback

Country:
- Europe > France (0.15)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (0.68)

Industry:
- Information Technology > Security & Privacy (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Speech (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found