WritingBench: AComprehensive Benchmark for Generative Writing
–Neural Information Processing Systems
Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text generation or limited in writing tasks, failing to capture the diverse requirements of high-quality written contents across various domains. To bridge this gap, we present WritingBench, a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains.We further propose a querydependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a finetuned critic model for criteria-aware scoring, enabling evaluations in style, format and length. The framework's validity is further demonstrated by its data curation capability, which enables a 7B-parameter model to outperform the performance of GPT-4o in writing. We open-source the benchmark, along with evaluation tools and modular framework components, to advance the development of LLMs in writing.
Neural Information Processing Systems
Jun-17-2026, 02:04:55 GMT
- Country:
- Asia (0.93)
- Europe > Austria (0.28)
- North America
- United States (0.46)
- Mexico (0.28)
- Genre:
- Overview (0.93)
- Research Report
- Experimental Study (1.00)
- New Finding (0.67)
- Industry:
- Law (1.00)
- Education (1.00)
- Information Technology > Security & Privacy (0.93)
- Banking & Finance (0.68)
- Technology: