Beyond Line-Level Filtering for the Pretraining Corpora of LLMs