DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer