TabLib: A Dataset of 627M Tables with Context

Eggert, Gus, Huo, Kevin, Biven, Mike, Waugh, Justin

Oct-11-2023–arXiv.org Artificial Intelligence

It is well-established that large, diverse datasets play a pivotal role in the performance of modern AI systems for text and image modalities. However, there are no datasets for tabular data of comparable size and diversity to those available for text and images. Thus we present "TabLib'', a compilation of 627 million tables totaling 69 TiB, along with 867B tokens of context. TabLib was extracted from numerous file formats, including CSV, HTML, SQLite, PDF, Excel, and others, sourced from GitHub and Common Crawl. The size and diversity of TabLib offer considerable promise in the table modality, reminiscent of the original promise of foundational datasets for text and images, such as The Pile and LAION.

arxiv, metadata, tablib, (14 more...)

arXiv.org Artificial Intelligence

Oct-11-2023

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Oregon (0.04)
    - North Dakota (0.04)
    - Nevada (0.04)
    - Maine (0.04)
    - Louisiana (0.04)
    - New York > New York County
      - New York City (0.14)
    - Massachusetts > Middlesex County
      - Cambridge (0.04)
    - Colorado > Boulder County
      - Boulder (0.04)
  - Canada
    - Quebec > Montreal (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - Indonesia (0.04)

Genre:
- Research Report (0.67)

Technology:
- Information Technology
  - Software (1.00)
  - Information Management (1.00)
  - Communications > Web (0.68)
  - Data Science
    - Data Mining (1.00)
    - Data Quality (0.67)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Natural Language > Large Language Model (0.93)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found