Supplementary File for ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Evaluation Capability for Large Vision-Language Models

Neural Information Processing Systems 

We calculate the agreement of human judgment and our automatic evaluation (i.e., ConvBenchEval()) and find it reaches 81.83% (seeing Table 3 - 6 for detailed agreement of each turn of overall). It demonstrates the effectiveness of ConvBenchEval(), which uses ChatGPT. The agreement between ChatGPT and GPT4 is very high at 87.38%. It demonstrates that using different LLMs as judges slightly influences the evaluation results. ConvBenchEval() armed with ChatGPT can is reliable and low-cost. From the above tables, we also observe that though GPT4V is expensive and can capture images, its judgment performs worse than GPT4's judgment.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found