Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning