Towards Self-Refinement of Vision-Language Models with Triangular Consistency