Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Open in new window