Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling