Data, Data Everywhere: A Guide for Pretraining Dataset Construction