What's in the Image? A Deep-Dive into the Vision of Vision Language Models