Analyzing Similarity Metrics for Data Selection for Language Model Pretraining