Harnessing Diversity for Important Data Selection in Pretraining Large Language Models