Vision Language Models See What You Want but not What You See