Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis