How should we proxy for race/ethnicity? Comparing Bayesian improved surname geocoding to machine learning methods
–arXiv.org Artificial Intelligence
Political science research often requires constructing a race/ethnicity proxy variable for datasets that do not contain it, like voter registration files, lists of electoral candidates, or political donation records. Constructing such a proxy is an important step for conducting ecological inference in voting rights litigation (Barreto et al. [2019], Imai and Khanna [2016]), redistricting (DeLuca and Curiel [2022], Kenny et al. [2021]), and substantive research on the role of race/ethnicity in politics (Enos [2016], Enos et al. [2019], Grumbach and Sahn [2020]). The most common method for proxying race/ethnicity is Bayesian Improved Surname Geocoding (BISG), which uses Bayes' rule to compute a probability distribution over race/ethnicity categories conditional on a voter's surname and where they live (Elliott et al. [2008, 2009]). BISG has attained widespread popularity due to its parsimony, computational efficiency, and superior performance when compared to existing alternatives, namely spatial interpolation of Census racial-ethnic composition from Census geographies (Imai and Khanna [2016], Clark et al. [2021], Shah and Davis [2017]). While BISG performs well compared to the small suite of existing alternatives, it has not yet been benchmarked against machine learning (ML) models, which can produce race/ethnicity predictions from more flexible and potentially more accurate models. In this paper I present the results of such a benchmark. I train a range of machine learning models using voter registration data from Florida, Georgia, North Carolina, and a portion of California where voters self-report their race/ethnicity upon registration. The registries in these states contain over 26 million labelled observations, which equates to greater than a five percent non-representative sample of the United States electorate. I then compare BISG against predictions from these models made out-of-state.
arXiv.org Artificial Intelligence
Aug-1-2022
- Country:
- Oceania > Australia
- North America
- Cuba (0.04)
- United States
- Georgia (0.55)
- North Carolina (0.25)
- Texas (0.04)
- New Jersey (0.04)
- Hawaii (0.04)
- New York
- Tompkins County > Ithaca (0.04)
- New York County > New York City (0.04)
- California > Los Angeles County
- Los Angeles (0.04)
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Government > Voting & Elections (1.00)