Towards Robust FastSpeech 2 by Modelling Residual Multimodality