Appendix412 Table of Contents

Jun-17-2026, 07:36:41 GMT–Neural Information Processing Systems

Starting from Grobid's XML output, peS2o filters papers that are too short, have453 incorrect metadata, are in languages other than English, and contain OCR errors using a combination454 of heuristic-and model-based filtering steps. We refer the reader to the datasheet and code for more455 details on this processing pipeline.456 The subset of peS2o included in the Common Pile starts from v3 of the corpus, which contains457 documents from January 1, 1970 to October 6, 2024. We retain full-text papers with CCBY,458 CCBY-SA, or CC0 licenses, or that have been labeled as public domain; metadata is provided459 by the Semantic Scholar APIs [85]. After filtering, this set contains 6.3 million papers, or 35.7460 billion whitespace-separated segments.

large language model, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Jun-17-2026, 07:36:41 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (1.00)
- Europe (1.00)
- Asia (1.00)

Genre:
- Instructional Material > Course Syllabus & Notes (0.67)

Industry:
- Education (1.00)
- Law > Intellectual Property & Technology Law (0.93)
- Media > News (0.92)
- Health & Medicine (0.67)
- Government > Regional Government
  - North America Government > United States Government (1.00)

Technology:
- Information Technology
  - Information Management (1.00)
  - Data Science (1.00)
  - Communications (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Natural Language > Large Language Model (0.68)
    - Machine Learning > Neural Networks
      - Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found