Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Open in new window