A Appendix

May-25-2025, 14:36:24 GMT–Neural Information Processing Systems

A.1 Creation of the Multimodal Web Document Dataset A.1.1 Collecting of a Large Number of HTML Files Our data collection process begins by considering the 25 most recent Common Crawl It contains webpages spanning from February 2020 to January/February 2023. This process yields a total of 41.2 billion documents. Selection of English content To identify non-English content, we apply the FastText classifier (Joulin et al., 2017) to the extracted text, e ectively filtering out 63.6% of the documents. Early text deduplication Often, a set of URLs is crawled repeatedly across di erent Common Crawl snapshots. However, the content of these websites may vary as web administrators make changes over time. Hence, at this stage, we refrain from deduplicating documents based on their URLs. Instead, we perform MinHash (Broder, 1997) deduplication with 16 hashes calculated over 5-grams. To further refine the data, we eliminate documents containing substantial proportions of repeated paragraphs and n-grams, employing the methodology described in MassiveText (Rae et al., 2022).

artificial intelligence, machine learning, natural language, (21 more...)

Neural Information Processing Systems

May-25-2025, 14:36:24 GMT

Conferences PDF

Add feedback

Country:
- Africa (1.00)
- Asia > Middle East
  - Palestine (0.28)
- Europe > United Kingdom (0.67)
- North America
  - Canada (0.93)
  - United States (1.00)

Genre:
- Research Report > Experimental Study (0.46)

Industry:
- Banking & Finance > Trading (0.93)
- Education (1.00)
- Media > Film (0.93)
- Government
  - Military (0.67)
  - Regional Government > North America Government
    - United States Government (0.67)
- Transportation > Air (0.67)
- Health & Medicine
  - Pharmaceuticals & Biotechnology (0.93)
  - Therapeutic Area
    - Immunology (1.00)
    - Infections and Infectious Diseases (1.00)
- Law (1.00)
- Information Technology (1.00)
- Leisure & Entertainment
  - Games > Computer Games (0.93)
  - Sports > Martial Arts (1.00)
- Consumer Products & Services (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Statistical Learning (0.46)
    - Natural Language (1.00)
  - Communications
    - Mobile (0.68)
    - Social Media (1.00)

Duplicate Docs Excel Report

Title
A Appendix

Similar Docs Excel Report more

Title	Similarity	Source
None found