Evaluating Vision-Language Models in the Wild with Human Preferences