When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models