Global Convergence in Training Large-Scale Transformers