Viewing Vision-Language Integration as a Double-Grounding Case