Robust Cross-modal Alignment Learning for Cross-Scene Spatial Reasoning and Grounding

Open in new window