DiscoTrack: A Multilingual LLM Benchmark for Discourse Tracking
Bu, Lanni, Levine, Lauren, Zeldes, Amir
–arXiv.org Artificial Intelligence
Recent LLM benchmarks have tested models on a range of phenomena, but are still focused primarily on natural language understanding for extraction of explicit information, such as QA or summarization, with responses often targeting information from individual sentences. We are still lacking more challenging, and importantly also multilingual, benchmarks focusing on implicit information and pragmatic inferences across larger documents in the context of discourse tracking: integrating and aggregating information across sentences, paragraphs and multiple speaker utterances. To this end, we present DiscoTrack, an LLM benchmark targeting a range of tasks across 12 languages and four levels of discourse understanding: salience recognition, entity tracking, discourse relations and bridging inference. Our evaluation shows that these tasks remain challenging, even for state-of-the-art models.
arXiv.org Artificial Intelligence
Nov-11-2025
- Country:
- Asia > Middle East
- UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Slovenia (0.04)
- France > Provence-Alpes-Côte d'Azur
- North America
- Canada > Ontario
- Toronto (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- California
- Marin County > San Rafael (0.04)
- San Francisco County > San Francisco (0.04)
- Maryland > Baltimore (0.04)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Virginia (0.04)
- California
- Canada > Ontario
- Asia > Middle East
- Genre:
- Research Report (1.00)
- Industry:
- Technology: