Localization vs. Semantics: How Can Language Benefit Visual Representation Learning?