Delayed Attention Training Improves Length Generalization in Transformer--RNN Hybrids