MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement
Wang, Zifeng, Gao, Chufan, Xiao, Cao, Sun, Jimeng
–arXiv.org Artificial Intelligence
Tabular data prediction has been employed in medical applications such as patient health risk prediction. However, existing methods usually revolve around the algorithm design while overlooking the significance of data engineering. As such, previous predictors are often trained on manually curated small datasets that struggle to generalize across different tabular datasets during inference. This paper proposes to scale medical tabular data predictors (MediTab) to various tabular inputs with varying features. The method uses a data engine that leverages large language models (LLMs) to consolidate tabular samples to overcome the barrier across tables with distinct schema. It also aligns out-domain data with the target task using a "learn, annotate, and refinement" pipeline. The expanded training data then enables the pre-trained MediTab to infer for arbitrary tabular input in the domain without fine-tuning, resulting in significant improvements over supervised baselines: it reaches an average ranking of 1.57 and 1.00 on 7 patient outcome prediction datasets and 3 trial outcome prediction datasets, respectively. In addition, MediTab exhibits impressive zero-shot performances: it outperforms supervised XGBoost models by 8.9% and 17.2% on average in two prediction tasks, respectively. Tabular data are structured as tables or spreadsheets in a relational database. Each row in the table represents a data sample, while columns represent various feature variables of different types, including categorical, numerical, binary, and textual features. Most previous papers focused on the model design of tabular predictors, mainly by (1) augmenting feature interactions via neural networks (Arik & Pfister, 2021), (2) improving tabular data representation learning by self-supervised pre-training (Yin et al., 2020; Yoon et al., 2020; Bahri et al., 2022), and (3) performing cross-tabular pre-training for transfer learning (Wang & Sun, 2022b; Zhu et al., 2023). Tabular data predictor was also employed in medicine, such as patient health risk prediction (Wang & Sun, 2022b) and clinical trial outcome prediction (Fu et al., 2022). Additionally, LLMs have been shown to be able to sample synthetic and yet highly realistic tabular data as well Borisov et al. (2022); Theodorou et al. (2023).
arXiv.org Artificial Intelligence
Oct-5-2023
- Country:
- Europe > France (0.04)
- North America > United States
- New York > New York County
- New York City (0.04)
- Illinois > Champaign County
- Urbana (0.04)
- New York > New York County
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.89)
- Research Report
- Industry:
- Health & Medicine
- Pharmaceuticals & Biotechnology (1.00)
- Therapeutic Area
- Oncology (1.00)
- Immunology (0.70)
- Obstetrics/Gynecology (0.69)
- Health & Medicine
- Technology: