SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems
Guo, Beichen, Wen, Zhiyuan, Yang, Yu, Gao, Peng, Yang, Ruosong, Shen, Jiaxing
–arXiv.org Artificial Intelligence
The growing interest in automatic survey generation (ASG), a task that traditionally required considerable time and effort, has been spurred by recent advances in large language models (LLMs). With advancements in retrieval-augmented generation (RAG) and the rising popularity of multi-agent systems (MASs), synthesizing academic surveys using LLMs has become a viable approach, thereby elevating the need for robust evaluation methods in this domain. However, existing evaluation methods suffer from several limitations, including biased metrics, a lack of human preference, and an over-reliance on LLMs-as-judges. To address these challenges, we propose SGSimEval, a comprehensive benchmark for Survey Generation with Similarity-Enhanced Evaluation that evaluates automatic survey generation systems by integrating assessments of the outline, content, and references, and also combines LLM-based scoring with quantitative metrics to provide a multifaceted evaluation framework. In SGSimEval, we also introduce human preference metrics that emphasize both inherent quality and similarity to humans. Extensive experiments reveal that current ASG systems demonstrate human-comparable superiority in outline generation, while showing significant room for improvement in content and reference generation, and our evaluation metrics maintain strong consistency with human assessments.
arXiv.org Artificial Intelligence
Aug-18-2025
- Country:
- Asia
- China > Hong Kong (0.05)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe > Austria
- Vienna (0.14)
- North America > United States
- New Mexico > Bernalillo County > Albuquerque (0.04)
- Asia
- Genre:
- Overview (0.94)
- Research Report (1.00)
- Technology: