Learning Street View Representations with Spatiotemporal Contrast