Supplementary File for ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Evaluation Capability for Large Vision-Language Models
–Neural Information Processing Systems
We calculate the agreement of human judgment and our automatic evaluation (i.e., ConvBenchEval()) and find it reaches 81.83% (seeing Table 3 - 6 for detailed agreement of each turn of overall). It demonstrates the effectiveness of ConvBenchEval(), which uses ChatGPT. The agreement between ChatGPT and GPT4 is very high at 87.38%. It demonstrates that using different LLMs as judges slightly influences the evaluation results. ConvBenchEval() armed with ChatGPT can is reliable and low-cost. From the above tables, we also observe that though GPT4V is expensive and can capture images, its judgment performs worse than GPT4's judgment.
Neural Information Processing Systems
Oct-10-2025, 14:13:11 GMT
- Country:
- Asia > China
- Indian Ocean > Arabian Sea (0.04)
- Industry:
- Information Technology > Security & Privacy (0.46)
- Law (1.00)
- Leisure & Entertainment > Games (0.46)
- Technology: