e2cfb719f58585f779d0a4f9f07bd618-Supplemental-Datasets_and_Benchmarks.pdf
–Neural Information Processing Systems
A.1 Creation of the Multimodal Web Document Dataset A.1.1 Collecting of a Large Number of HTMLFiles Our data collection process begins by considering the 25 most recent Common Crawl6 dumps available at the time of dataset creation. It contains webpages spanning from February 2020 to January/February 2023. We use a modified version of readability-lxml7 to extract the main text from the pages, discarding any pages that contain text of excessively high perplexity. This process yields a total of 41.2 billion documents. Selection of English content To identify non-English content, we apply the FastText classifier (Joulin et al., 2017) to the extracted text, e ectively filtering out 63.6% of the documents. Early text deduplication Often, a set of URLs is crawled repeatedly across di erent Common Crawl snapshots. However, the content of these websites may vary as web administrators make changes over time. Hence, at this stage, we refrain from deduplicating documents based on their URLs. Instead, we perform MinHash (Broder, 1997) deduplication with 16 hashes calculated over 5-grams. To further refine the data, we eliminate documents containing substantial proportions of repeated paragraphs and n-grams, employing the methodology described in MassiveText (Rae et al., 2022).
Neural Information Processing Systems
Apr-30-2026, 02:17:07 GMT
- Country:
- Africa (1.00)
- Europe > United Kingdom (0.67)
- North America
- United States (1.00)
- Canada (0.93)
- Asia > Middle East
- Palestine (0.28)
- Genre:
- Research Report > Experimental Study (0.46)
- Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Consumer Products & Services (1.00)
- Law (1.00)
- Education (1.00)
- Banking & Finance > Trading (0.93)
- Media > Film (0.93)
- Information Technology > Security & Privacy (0.67)
- Transportation > Air (0.67)
- Leisure & Entertainment
- Sports > Martial Arts (1.00)
- Games > Computer Games (0.93)
- Health & Medicine
- Pharmaceuticals & Biotechnology (1.00)
- Consumer Health (0.93)
- Therapeutic Area
- Infections and Infectious Diseases (1.00)
- Immunology (1.00)
- Psychiatry/Psychology (0.68)
- Government
- Technology:
- Information Technology
- Security & Privacy (0.67)
- Communications
- Social Media (1.00)
- Mobile (0.68)
- Artificial Intelligence
- Natural Language (1.00)
- Machine Learning > Statistical Learning (0.46)
- Information Technology