Sheetpedia: A300K-Spreadsheet Corpus for Spreadsheet Intelligence and LLMFine-Tuning
–Neural Information Processing Systems
Spreadsheets are widely used for data analysis and reporting, yet their complex structure and formula logic pose significant challenges for AI systems. We introduce Sheetpedia, a large-scale corpus of over 290,000 diverse spreadsheets (from 324,000+ workbooks) compiled from enterprise email archives and online forums. We detail a rigorous collection and preprocessing pipeline (integrating the Enron email spreadsheet archive and the Fuse web corpus, plus a new crawl of Excel forums) to standardize formats, filter languages, and remove duplicates. Sheetpedia provides extensive coverage of real formulas and annotations - addressing a gap left by prior table datasets (e.g.
Neural Information Processing Systems
Jun-19-2026, 19:19:47 GMT
- Country:
- Europe (0.67)
- North America > United States (0.28)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.67)
- Research Report
- Industry:
- Information Technology (1.00)
- Technology: