Beyond Line-Level Filtering for the Pretraining Corpora of LLMs
Park, Chanwoo, Park, Suyoung, Ahn, Yelim, Kim, Jongmin, Park, Jongyeon, Lee, Jaejin
–arXiv.org Artificial Intelligence
While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.
arXiv.org Artificial Intelligence
Oct-29-2025
- Country:
- Asia
- Middle East > Jordan (0.04)
- South Korea > Seoul
- Seoul (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- France (0.04)
- Belgium > Brussels-Capital Region
- North America > United States
- California
- San Francisco County > San Francisco (0.04)
- Santa Clara County > Santa Clara (0.04)
- Colorado (0.04)
- New York (0.04)
- Virginia (0.04)
- California
- Pacific Ocean > North Pacific Ocean
- San Francisco Bay (0.04)
- Asia
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Education (1.00)
- Government > Regional Government (0.93)
- Leisure & Entertainment > Sports
- Football (1.00)
- Technology: