Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?