Toxicity of the Commons: Curating Open-Source Pre-Training Data