CAVIAR: Categorical-Variable Embeddings for Accurate and Robust Inference
Mukherjee, Anirban, Chang, Hannah Hanwen
–arXiv.org Artificial Intelligence
Social science research often hinges on the relationship between categorical variables and outcomes. We introduce CAVIAR, a novel method for embedding categorical variables that assume values in a high-dimensional ambient space but are sampled from an underlying manifold. Our theoretical and numerical analyses outline challenges posed by such categorical variables in causal inference. Specifically, dynamically varying and sparse levels can lead to violations of the Donsker conditions and a failure of the estimation functionals to converge to a tight Gaussian process. Traditional approaches, including the exclusion of rare categorical levels and principled variable selection models like LASSO, fall short. CAVIAR embeds the data into a lower-dimensional global coordinate system. The mapping can be derived from both structured and unstructured data, and ensures stable and robust estimates through dimensionality reduction. In a dataset of direct-to-consumer apparel sales, we illustrate how high-dimensional categorical variables, such as zip codes, can be succinctly represented, facilitating inference and analysis.
arXiv.org Artificial Intelligence
Apr-11-2024
- Country:
- Asia
- Singapore (0.04)
- Vietnam > Long An Province (0.04)
- North America > United States
- California
- Alameda County > Berkeley (0.04)
- Los Angeles County > Los Angeles (0.04)
- San Francisco County > San Francisco (0.14)
- Ventura County > Oxnard (0.04)
- District of Columbia > Washington (0.04)
- New Jersey > Mercer County
- Princeton (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Hawaii (0.04)
- New York
- Erie County > Buffalo (0.04)
- New York County
- Manhattan (0.04)
- New York City (0.04)
- Tompkins County > Ithaca (0.04)
- Florida
- Miami-Dade County > Miami (0.04)
- Monroe County > Key West (0.04)
- Maine (0.04)
- Alaska (0.04)
- California
- Asia
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Education (1.00)
- Health & Medicine > Therapeutic Area (0.68)
- Technology: