From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models