CAVIAR: Categorical-Variable Embeddings for Accurate and Robust Inference

Mukherjee, Anirban, Chang, Hannah Hanwen

Apr-11-2024–arXiv.org Artificial Intelligence

Social science research often hinges on the relationship between categorical variables and outcomes. We introduce CAVIAR, a novel method for embedding categorical variables that assume values in a high-dimensional ambient space but are sampled from an underlying manifold. Our theoretical and numerical analyses outline challenges posed by such categorical variables in causal inference. Specifically, dynamically varying and sparse levels can lead to violations of the Donsker conditions and a failure of the estimation functionals to converge to a tight Gaussian process. Traditional approaches, including the exclusion of rare categorical levels and principled variable selection models like LASSO, fall short. CAVIAR embeds the data into a lower-dimensional global coordinate system. The mapping can be derived from both structured and unstructured data, and ensures stable and robust estimates through dimensionality reduction. In a dataset of direct-to-consumer apparel sales, we illustrate how high-dimensional categorical variables, such as zip codes, can be succinctly represented, facilitating inference and analysis.

categorical variable, coefficient, zip code, (15 more...)

arXiv.org Artificial Intelligence

Apr-11-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Alaska (0.04)
  - Maine (0.04)
  - Hawaii (0.04)
  - District of Columbia > Washington (0.04)
  - Florida
    - Monroe County > Key West (0.04)
    - Miami-Dade County > Miami (0.04)
  - New York
    - Tompkins County > Ithaca (0.04)
    - Erie County > Buffalo (0.04)
    - New York County
      - New York City (0.04)
      - Manhattan (0.04)
  - Illinois > Cook County
    - Chicago (0.04)
  - New Jersey > Mercer County
    - Princeton (0.04)
  - California
    - San Francisco County > San Francisco (0.14)
    - Ventura County > Oxnard (0.04)
    - Los Angeles County > Los Angeles (0.04)
    - Alameda County > Berkeley (0.04)
- Asia
  - Singapore (0.04)
  - Vietnam > Long An Province (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Education (1.00)
- Health & Medicine > Therapeutic Area (0.68)

Technology:
- Information Technology
  - Data Science (1.00)
  - Artificial Intelligence > Machine Learning
    - Statistical Learning > Regression (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found