Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback