Agreeing to Interact in Human-Robot Interaction using Large Language Models and Vision Language Models