GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions