A Appendix
–Neural Information Processing Systems
A.1 Creation of the Multimodal Web Document Dataset A.1.1 Collecting of a Large Number of HTML Files Our data collection process begins by considering the 25 most recent Common Crawl It contains webpages spanning from February 2020 to January/February 2023. This process yields a total of 41.2 billion documents. Selection of English content To identify non-English content, we apply the FastText classifier (Joulin et al., 2017) to the extracted text, e ectively filtering out 63.6% of the documents. Early text deduplication Often, a set of URLs is crawled repeatedly across di erent Common Crawl snapshots. However, the content of these websites may vary as web administrators make changes over time. Hence, at this stage, we refrain from deduplicating documents based on their URLs. Instead, we perform MinHash (Broder, 1997) deduplication with 16 hashes calculated over 5-grams. To further refine the data, we eliminate documents containing substantial proportions of repeated paragraphs and n-grams, employing the methodology described in MassiveText (Rae et al., 2022).
Neural Information Processing Systems
May-25-2025, 14:36:24 GMT
- Country:
- Africa (1.00)
- Asia > Middle East
- Palestine (0.28)
- Europe > United Kingdom (0.67)
- North America
- Canada (0.93)
- United States (1.00)
- Genre:
- Research Report > Experimental Study (0.46)
- Industry:
- Banking & Finance > Trading (0.93)
- Education (1.00)
- Media > Film (0.93)
- Government
- Transportation > Air (0.67)
- Health & Medicine
- Law (1.00)
- Information Technology (1.00)
- Leisure & Entertainment
- Games > Computer Games (0.93)
- Sports > Martial Arts (1.00)
- Consumer Products & Services (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Technology: