Clarification as Supervision: Reinforcement Learning for Vision-Language Interfaces