An Empirical Study on End-to-End Singing Voice Synthesis with Encoder-Decoder Architectures