Cleaner Pretraining Corpus Curation with Neural Web Scraping

Xu, Zhipeng, Liu, Zhenghao, Yan, Yukun, Liu, Zhiyuan, Yu, Ge, Xiong, Chenyan

Jun-14-2024–arXiv.org Artificial Intelligence

The web contains large-scale, diverse, and abundant information to satisfy the information-seeking needs of humans. Through meticulous data collection, preprocessing, and curation, webpages can be used as a fundamental data resource for language model pretraining. However, when confronted with the progressively revolutionized and intricate nature of webpages, rule-based/feature-based web scrapers are becoming increasingly inadequate. This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages. Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement, demonstrating its potential in extracting higher-quality data to facilitate the language model pretraining. All of the code is available at https://github.com/OpenMatch/NeuScraper.

neuscraper, proceedings, scraper, (15 more...)

arXiv.org Artificial Intelligence

Jun-14-2024

arXiv.org PDF

Add feedback

Country:
- Asia
  - China
    - Beijing > Beijing (0.04)
    - Liaoning Province (0.04)
  - Middle East > Jordan (0.04)
- Europe
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
- North America > United States
  - Florida > Hillsborough County > University (0.04)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning (1.00)
    - Natural Language (1.00)
    - Representation & Reasoning > Rule-Based Reasoning (0.35)
  - Communications > Web (0.95)
  - Data Science > Data Mining
    - Web Mining (0.42)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found