Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

Open in new window