Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction
Guichoux, Téo, Lemerle, Théodor, Mehta, Shivam, Beskow, Jonas, Henter, Gustav Eje, Soulier, Laure, Pelachaud, Catherine, Obin, Nicolas
–arXiv.org Artificial Intelligence
Early approaches used au-toregressive sequence modeling to map speech or text to motion sequences [19, 9], while diffusion-based generators now dominate for their ability to produce detailed, temporally consistent, and natural gestures [12, 10]. Other works explore discrete motion representations, enabling more controllable synthesis [8]. These models accept either speech or text as input and typically rely on speaker embeddings for multi-speaker modeling, which limits their generalization ability to speakers unseen during training. In contrast, Gelina generates both speech and gestures directly from text, and can also clone voice and gestural style through sequence continuation using a speech-gesture prompt, without relying on speaker embeddings. T ext-to-speech approaches: Lately, TTS has shifted toward data-driven methods, with notable advances in discrete code modeling [4, 5, 6].
arXiv.org Artificial Intelligence
Dec-1-2025
- Country:
- Asia > Japan
- Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Europe
- France > Île-de-France
- Germany > Berlin (0.04)
- Sweden > Stockholm
- Stockholm (0.04)
- Switzerland (0.04)
- North America > United States
- Illinois > Cook County > Chicago (0.04)
- Asia > Japan
- Genre:
- Research Report
- Experimental Study (0.68)
- New Finding (1.00)
- Research Report
- Industry:
- Information Technology > Security & Privacy (0.48)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (1.00)
- Natural Language (1.00)
- Speech (1.00)
- Information Technology > Artificial Intelligence