Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction