Appendix412 Table of Contents

Neural Information Processing Systems 

Starting from Grobid's XML output, peS2o filters papers that are too short, have453 incorrect metadata, are in languages other than English, and contain OCR errors using a combination454 of heuristic-and model-based filtering steps. We refer the reader to the datasheet and code for more455 details on this processing pipeline.456 The subset of peS2o included in the Common Pile starts from v3 of the corpus, which contains457 documents from January 1, 1970 to October 6, 2024. We retain full-text papers with CCBY,458 CCBY-SA, or CC0 licenses, or that have been labeled as public domain; metadata is provided459 by the Semantic Scholar APIs [85]. After filtering, this set contains 6.3 million papers, or 35.7460 billion whitespace-separated segments.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found