Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

Open in new window