Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

Open in new window