Cleaner Pretraining Corpus Curation with Neural Web Scraping