Can Vision-Language Models Think from a First-Person Perspective?