Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

Open in new window