Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean
Park, Chanwoo, Park, Suyoung, Kang, JiA, Park, Jongyeon, Kim, Sangho, Park, Hyunji M., Bae, Sumin, Kang, Mingyu, Lee, Jaejin
–arXiv.org Artificial Intelligence
We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.
arXiv.org Artificial Intelligence
Oct-29-2025