GLIPv2: Unifying Localization and Vision-Language Understanding