Can Vision-Language Models Think from a First-Person Perspective?

Open in new window