Goto

Collaborating Authors

 bisg


LSTM+Geo with xgBoost Filtering: A Novel Approach for Race and Ethnicity Imputation with Reduced Bias

arXiv.org Artificial Intelligence

Accurate imputation of race and ethnicity (R&E) is crucial for analyzing disparities and informing policy. Methods like Bayesian Improved Surname Geocoding (BISG) are widely used but exhibit limitations, including systematic misclassification biases linked to socioeconomic status. This paper introduces LSTM+Geo, a novel approach enhancing Long Short-Term Memory (LSTM) networks with census tract geolocation information. Using a large voter dataset, we demonstrate that LSTM+Geo (88.7% accuracy) significantly outperforms standalone LSTM (86.4%) and Bayesian methods like BISG (82.9%) and BIFSG (86.8%) in accuracy and F1-score on a held-out validation set. LSTM+Geo reduces the rate at which non-White individuals are misclassified as White (White FPR 19.3%) compared to name-only LSTMs (White FPR 24.6%). While sophisticated ensemble methods incorporating XGBoost achieve the highest overall accuracy (up to 89.4%) and lowest White FPR (17.8%), LSTM+Geo offers strong standalone performance with improved bias characteristics compared to baseline models. Integrating LSTM+Geo into an XGBoost ensemble further boosts accuracy, highlighting its utility as both a standalone model and a component for advanced systems. We give a caution at the end regarding the appropriate use of these methods.


Can We Trust Race Prediction?

arXiv.org Artificial Intelligence

In this paper, I train a Bidirectional Long Short-Term Memory (BiLSTM) model on a novel dataset of voter registration data from all 50 US states and create an ensemble that achieves up to 36.8% higher out of sample (OOS) F1 scores than the best performing machine learning models in the literature. Additionally, I construct the most comprehensive database of first and surname distributions in the US in order to improve the coverage and accuracy of Bayesian Improved Surname Geocoding (BISG) and Bayesian Improved Firstname Surname Geocoding (BIFSG). Finally, I provide the first high-quality benchmark dataset in order to fairly compare existing models and aid future model developers.


How should we proxy for race/ethnicity? Comparing Bayesian improved surname geocoding to machine learning methods

arXiv.org Artificial Intelligence

Political science research often requires constructing a race/ethnicity proxy variable for datasets that do not contain it, like voter registration files, lists of electoral candidates, or political donation records. Constructing such a proxy is an important step for conducting ecological inference in voting rights litigation (Barreto et al. [2019], Imai and Khanna [2016]), redistricting (DeLuca and Curiel [2022], Kenny et al. [2021]), and substantive research on the role of race/ethnicity in politics (Enos [2016], Enos et al. [2019], Grumbach and Sahn [2020]). The most common method for proxying race/ethnicity is Bayesian Improved Surname Geocoding (BISG), which uses Bayes' rule to compute a probability distribution over race/ethnicity categories conditional on a voter's surname and where they live (Elliott et al. [2008, 2009]). BISG has attained widespread popularity due to its parsimony, computational efficiency, and superior performance when compared to existing alternatives, namely spatial interpolation of Census racial-ethnic composition from Census geographies (Imai and Khanna [2016], Clark et al. [2021], Shah and Davis [2017]). While BISG performs well compared to the small suite of existing alternatives, it has not yet been benchmarked against machine learning (ML) models, which can produce race/ethnicity predictions from more flexible and potentially more accurate models. In this paper I present the results of such a benchmark. I train a range of machine learning models using voter registration data from Florida, Georgia, North Carolina, and a portion of California where voters self-report their race/ethnicity upon registration. The registries in these states contain over 26 million labelled observations, which equates to greater than a five percent non-representative sample of the United States electorate. I then compare BISG against predictions from these models made out-of-state.