Learning to Maximize Mutual Information for Chain-of-Thought Distillation