Understanding and Improving In-Context Learning on Vision-language Models