Supplementary File for ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Evaluation Capability for Large Vision-Language Models

Oct-10-2025, 14:13:11 GMT–Neural Information Processing Systems

We calculate the agreement of human judgment and our automatic evaluation (i.e., ConvBenchEval()) and find it reaches 81.83% (seeing Table 3 - 6 for detailed agreement of each turn of overall). It demonstrates the effectiveness of ConvBenchEval(), which uses ChatGPT. The agreement between ChatGPT and GPT4 is very high at 87.38%. It demonstrates that using different LLMs as judges slightly influences the evaluation results. ConvBenchEval() armed with ChatGPT can is reliable and low-cost. From the above tables, we also observe that though GPT4V is expensive and can capture images, its judgment performs worse than GPT4's judgment.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Oct-10-2025, 14:13:11 GMT

Conferences PDF

Add feedback

Country:
- Asia > China
  - Hong Kong (0.04)
  - Shanghai > Shanghai (0.04)
- Indian Ocean > Arabian Sea (0.04)

Industry:
- Information Technology > Security & Privacy (0.46)
- Law (1.00)
- Leisure & Entertainment > Games (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)

Duplicate Docs Excel Report

Title
Supplementary File for ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Evaluation Capability for Large Vision-Language Models

Similar Docs Excel Report more

Title	Similarity	Source
None found