Chain-of-Thought Reasoning is a Policy Improvement Operator