Metis: A Foundation Speech Generation Model with Masked Generative Pre-training

Jun-14-2026, 02:38:00 GMT–Neural Information Processing Systems

Unlike previous task-specific or multi-task models, Metis follows a pre-training and fine-tuning paradigm. It is pre-trained on large-scale unlabeled speech data using masked generative modeling and then fine-tuned to adapt to diverse speech generation tasks. Specifically, (1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning (SSL) features, and acoustic tokens directly quantized from waveforms.

artificial intelligence, machine learning, proceedings, (10 more...)

Neural Information Processing Systems

Jun-14-2026, 02:38:00 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)