High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models