Appendix412 Table of Contents
–Neural Information Processing Systems
Starting from Grobid's XML output, peS2o filters papers that are too short, have453 incorrect metadata, are in languages other than English, and contain OCR errors using a combination454 of heuristic-and model-based filtering steps. We refer the reader to the datasheet and code for more455 details on this processing pipeline.456 The subset of peS2o included in the Common Pile starts from v3 of the corpus, which contains457 documents from January 1, 1970 to October 6, 2024. We retain full-text papers with CCBY,458 CCBY-SA, or CC0 licenses, or that have been labeled as public domain; metadata is provided459 by the Semantic Scholar APIs [85]. After filtering, this set contains 6.3 million papers, or 35.7460 billion whitespace-separated segments.
Neural Information Processing Systems
Jun-17-2026, 07:36:41 GMT
- Country:
- North America > United States (1.00)
- Europe (1.00)
- Asia (1.00)
- Genre:
- Industry:
- Education (1.00)
- Law > Intellectual Property & Technology Law (0.93)
- Media > News (0.92)
- Health & Medicine (0.67)
- Government > Regional Government
- Technology: