Do Pre-trained Vision-Language Models Encode Object States?