Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models