Textwash -- automated open-source text anonymisation
Kleinberg, Bennett, Davies, Toby, Mozes, Maximilian
–arXiv.org Artificial Intelligence
With the increasing digitisation of society and human communication, text data are becoming more important for research in the social and behavioural sciences (Gentzkow, Kelly, and Taddy 2019; Salganik 2019). Advances made in natural language processing (NLP) in particular have led to exciting insights derived from text data (e.g., on emotional responses to the pandemic (Kleinberg, Vegt, and Mozes 2020) or on the rhetoric around immigration in political speeches (Card et al. 2022); for an overview, see (Boyd and Schwartz 2021)). Importantly, the use of computational techniques to quantify and analyse text data has triggered a demand, especially for large datasets (often of several tens of thousands of documents) that can be harnessed for machine learning approaches (e.g., (Socher et al. 2013; Lewis et al. 2020)). That status quo of a need for larger datasets and an appetite to use text data for the study of social science phenomena has resulted in a dilemma: many of the important questions require targeted, primary data collection or access to potentially sensitive data. However, such data are hard to obtain, not because they do not exist but because sharing them is constrained by data protection regulations and ethical concerns. One potential consequence is that research activity may be biased toward topics for which suitable data is more readily available rather than those most important. One of the few viable solutions to this dilemma is automated text anonymisation; that is, the large-scale processing of text data so that individuals cannot be identified from the resulting output. Such a method would allow for the flow of sensitive data so that the staggering potential of text data can be exploited for scientific progress. With this paper and the tool it introduces, we seek to enable researchers to work with such sensitive data in a way that protects the privacy of individuals whilst retaining the usefulness of anonymised data for computational text analysis.
arXiv.org Artificial Intelligence
Aug-27-2022
- Country:
- Asia
- China > Hong Kong (0.04)
- Middle East > Republic of Türkiye
- Batman Province > Batman (0.04)
- Russia (0.04)
- Europe
- Belgium (0.04)
- France (0.04)
- Netherlands (0.14)
- Russia > Northwestern Federal District
- Leningrad Oblast > Saint Petersburg (0.04)
- Spain > Galicia
- Madrid (0.04)
- United Kingdom > England
- Greater London > London (0.04)
- Oxfordshire (0.04)
- North America > United States
- Indiana > Saint Joseph County
- Granger (0.04)
- New Jersey > Mercer County
- Princeton (0.04)
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- Indiana > Saint Joseph County
- Asia
- Genre:
- Instructional Material
- Course Syllabus & Notes (0.64)
- Online (0.50)
- Research Report
- Experimental Study (0.67)
- New Finding (0.67)
- Instructional Material
- Industry:
- Government > Regional Government (1.00)
- Health & Medicine > Therapeutic Area (0.89)
- Information Technology > Security & Privacy (1.00)
- Law (0.93)
- Leisure & Entertainment > Sports
- Motorsports > Formula One (0.46)
- Media
- Technology: