Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision