PORTAL: Scalable Tabular Foundation Models via Content-Specific Tokenization

Spinaci, Marco, Polewczyk, Marek, Hoffart, Johannes, Kohler, Markus C., Thelin, Sam, Klein, Tassilo

Oct-17-2024–arXiv.org Artificial Intelligence

Self-supervised learning on tabular data seeks to apply advances from natural language and image domains to the diverse domain of tables. However, current techniques often struggle with integrating multi-domain data and require data cleaning or specific structural requirements, limiting the scalability of pre-training datasets. We introduce PORTAL (Pretraining One-Row-at-a-Time for All tabLes), a framework that handles various data modalities without the need for cleaning or preprocessing. This simple yet powerful approach can be effectively pre-trained on online-collected datasets and fine-tuned to match state-of-the-art methods on complex classification and regression tasks. This work offers a practical advancement in self-supervised learning for large-scale tabular data.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Oct-17-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
- Europe
  - France (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)

Genre:
- Research Report > Promising Solution (0.34)

Technology:
- Information Technology
  - Data Science (1.00)
  - Artificial Intelligence
    - Natural Language > Large Language Model (0.69)
    - Machine Learning > Neural Networks
      - Deep Learning (0.94)