Appendix of GLIPv2: Unifying Localization and Vision-Language Understanding