Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization