GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

Open in new window