Efficient Pre-training for Localized Instruction Generation of Videos