Learning Control Policies for Stochastic Systems with Reach-avoid Guarantees