TabText: A Flexible and Contextual Approach to Tabular Data Representation

Carballo, Kimberly Villalobos, Na, Liangyuan, Ma, Yu, Boussioux, Léonard, Zeng, Cynthia, Soenksen, Luis R., Bertsimas, Dimitris

Jul-21-2023–arXiv.org Artificial Intelligence

Tabular data remains the most widely used and readily available data format across various fields ranging from education, healthcare, and technology, where it serves a vital role in capturing all domains of information. Preprocessing tabular data accurately and efficiently is essential for creating reliable downstream models in machine learning applications. Yet, two significant limitations exist for directly incorporating tabular data into modeling pipelines: they require labor-intensive, often manual, data processing to standardize information across heterogeneous tabular structures and data sources, and they ignore contextual information such as column headers and meta content descriptions. In contrast to tabular approaches, language is a very flexible data modality that can easily represent information about different data points without imposing any structural similarity between them. Furthermore, recent developments on off-the-shelf large language models (LLMs) based on the Transformer architecture (Vaswani et al, 2017) offer state-of-the-art performances on a wide range of language tasks, including translation, sentence completion, and question answering. These pre-trained models are often developed with very large and diverse data sets, allowing them to exploit prior knowledge and make accurate predictions with very few new training samples.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Jul-21-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Massachusetts > Middlesex County > Cambridge (0.05)
- Europe
  - Switzerland (0.04)
  - Denmark (0.04)
  - Sweden > Vaestra Goetaland
    - Gothenburg (0.04)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine > Health Care Technology > Medical Record (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found