CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation

Gorla, Aditya, Wang, Ryan, Liu, Zhengtong, An, Ulzee, Sankararaman, Sriram

Jun-4-2025–arXiv.org Machine Learning

We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features - captured by column names and text descriptions - to better represent feature dependence. These dual sources of inductive bias enable CACTI to outperform state-of-the-art methods - an average $R^2$ gain of 7.8% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) - across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

Jun-4-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > New Zealand (0.04)
- North America
  - Canada (0.04)
  - United States > California
    - Los Angeles County > Los Angeles (0.14)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (0.93)

Industry:
- Health & Medicine (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language (1.00)
  - Machine Learning
    - Statistical Learning (1.00)
    - Neural Networks > Deep Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found