MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
Sirdeshmukh, Ved, Deshpande, Kaustubh, Mols, Johannes, Jin, Lifeng, Cardona, Ed-Yeremai, Lee, Dean, Kritz, Jeremy, Primack, Willow, Yue, Summer, Xing, Chen
–arXiv.org Artificial Intelligence
We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challenging to all current frontier LLMs. All 4 challenges require accurate instruction-following, context allocation, and in-context reasoning at the same time. We also develop LLM as judge with instance-level rubrics to facilitate an automatic evaluation method with fair agreement with experienced human raters. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge, with the top-performing Claude 3.5 Sonnet (June 2024) achieving just a 41.4% average accuracy.
arXiv.org Artificial Intelligence
Jan-28-2025
- Country:
- Africa
- Malawi (0.04)
- Middle East > Egypt (0.04)
- Rwanda (0.04)
- Asia
- China (0.04)
- India (0.04)
- Japan (0.04)
- Middle East > Republic of Türkiye
- Batman Province > Batman (0.04)
- Thailand (0.04)
- Europe
- Austria (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Ireland (0.04)
- Italy (0.04)
- United Kingdom > England (0.04)
- North America > United States
- California > Los Angeles County
- Los Angeles (0.04)
- Illinois > Cook County
- Chicago (0.04)
- California > Los Angeles County
- Oceania > New Zealand (0.04)
- Africa
- Genre:
- Instructional Material > Course Syllabus & Notes (1.00)
- Personal (0.68)
- Research Report (1.00)
- Workflow (1.00)
- Industry:
- Consumer Products & Services > Restaurants (1.00)
- Education (0.92)
- Government > Military (0.67)
- Health & Medicine
- Consumer Health (1.00)
- Therapeutic Area > Pediatrics/Neonatology (0.31)
- Information Technology > Security & Privacy (1.00)
- Leisure & Entertainment > Sports (0.92)
- Media > Film (1.00)
- Technology: