Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization

Jin, Keyan, Wang, Yapeng, Santos, Leonel, Fang, Tao, Yang, Xu, Im, Sio Kei, Oliveira, Hugo Gonçalo

arXiv.org Artificial Intelligence 

Dialogue summarization is a critical natural language processing task that supports numerous practical applications, such as customer service, meeting analysis, and conversational AI assistants. Unlike traditional document summarization, dialogue summarization must handle unique challenges, including multi-party interactions, fragmented utterances, ambiguous references, and frequent topic shifts. Additionally, effective summarization can facilitate automated meeting documentation, collaborative decision-making, and efficient information retrieval from dialogue records. Early advances relied primarily on extractive methods that selected key sentences based on simple heuristics like TF-IDF or word frequency (Marcu, 1997), before evolving to neural approaches such as Seq2Seq and Pointer-Generator networks, which enabled more fluent abstractive summaries (Rush et al., 2015; See et al., 2017). Subsequently, significant breakthroughs were achieved by adapting Transformer-based neural architectures to conversational settings (Lewis et al., 2019; Liang et al., 2022; Jin et al., 2025). Large language models (LLMs) have achieved remarkable results across a wide variety of natural language processing tasks, including text classification, sentiment analysis, question answering, and translation, demonstrating strong generalization capabilities and state-of-the-art performance (Brown et al., 2020). In particular, reasoning LLMs, such as OpenAI-o1, DeepSeek-R1, and QwQ-32B, have exhibited notable advantages in tasks requiring complex reasoning, such as mathematical problem solving, logical inference, and machine translation (Chen et al., 2025a; Ye et al., 2025). These successes naturally prompt further exploration into their applicability within dialogue summarization. Dialogue summarization encompasses multiple distinct paradigms, each reflecting real-world scenarios that vary significantly in language, domain, dialogue length, and user intent.