Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis