TabText: A Flexible and Contextual Approach to Tabular Data Representation
Carballo, Kimberly Villalobos, Na, Liangyuan, Ma, Yu, Boussioux, Léonard, Zeng, Cynthia, Soenksen, Luis R., Bertsimas, Dimitris
–arXiv.org Artificial Intelligence
Tabular data remains the most widely used and readily available data format across various fields ranging from education, healthcare, and technology, where it serves a vital role in capturing all domains of information. Preprocessing tabular data accurately and efficiently is essential for creating reliable downstream models in machine learning applications. Yet, two significant limitations exist for directly incorporating tabular data into modeling pipelines: they require labor-intensive, often manual, data processing to standardize information across heterogeneous tabular structures and data sources, and they ignore contextual information such as column headers and meta content descriptions. In contrast to tabular approaches, language is a very flexible data modality that can easily represent information about different data points without imposing any structural similarity between them. Furthermore, recent developments on off-the-shelf large language models (LLMs) based on the Transformer architecture (Vaswani et al, 2017) offer state-of-the-art performances on a wide range of language tasks, including translation, sentence completion, and question answering. These pre-trained models are often developed with very large and diverse data sets, allowing them to exploit prior knowledge and make accurate predictions with very few new training samples.
arXiv.org Artificial Intelligence
Jul-21-2023
- Country:
- North America > United States
- Massachusetts > Middlesex County > Cambridge (0.05)
- Europe
- Switzerland (0.04)
- Denmark (0.04)
- Sweden > Vaestra Goetaland
- Gothenburg (0.04)
- North America > United States
- Genre:
- Research Report (1.00)
- Industry:
- Technology: