Textwash -- automated open-source text anonymisation

Kleinberg, Bennett, Davies, Toby, Mozes, Maximilian

Aug-27-2022–arXiv.org Artificial Intelligence

With the increasing digitisation of society and human communication, text data are becoming more important for research in the social and behavioural sciences (Gentzkow, Kelly, and Taddy 2019; Salganik 2019). Advances made in natural language processing (NLP) in particular have led to exciting insights derived from text data (e.g., on emotional responses to the pandemic (Kleinberg, Vegt, and Mozes 2020) or on the rhetoric around immigration in political speeches (Card et al. 2022); for an overview, see (Boyd and Schwartz 2021)). Importantly, the use of computational techniques to quantify and analyse text data has triggered a demand, especially for large datasets (often of several tens of thousands of documents) that can be harnessed for machine learning approaches (e.g., (Socher et al. 2013; Lewis et al. 2020)). That status quo of a need for larger datasets and an appetite to use text data for the study of social science phenomena has resulted in a dilemma: many of the important questions require targeted, primary data collection or access to potentially sensitive data. However, such data are hard to obtain, not because they do not exist but because sharing them is constrained by data protection regulations and ethical concerns. One potential consequence is that research activity may be biased toward topics for which suitable data is more readily available rather than those most important. One of the few viable solutions to this dilemma is automated text anonymisation; that is, the large-scale processing of text data so that individuals cannot be identified from the resulting output. Such a method would allow for the flow of sensitive data so that the staggering potential of text data can be exploited for scientific progress. With this paper and the tool it introduces, we seek to enable researchers to work with such sensitive data in a way that protects the privacy of individuals whilst retaining the usefulness of anonymised data for computational text analysis.

information, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Aug-27-2022

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Pennsylvania > Allegheny County
    - Pittsburgh (0.04)
  - New Jersey > Mercer County
    - Princeton (0.04)
  - Indiana > Saint Joseph County
    - Granger (0.04)
- Europe
  - Netherlands (0.14)
  - France (0.04)
  - Belgium (0.04)
  - United Kingdom > England
    - Greater London > London (0.04)
    - Oxfordshire (0.04)
  - Spain > Galicia
    - Madrid (0.04)
  - Russia > Northwestern Federal District
    - Leningrad Oblast > Saint Petersburg (0.04)
- Asia
  - Russia (0.04)
  - China > Hong Kong (0.04)
  - Middle East > Republic of Türkiye
    - Batman Province > Batman (0.04)

Genre:
- Research Report
  - New Finding (0.67)
  - Experimental Study (0.67)
- Instructional Material
  - Course Syllabus & Notes (0.64)
  - Online (0.50)

Industry:
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government (1.00)
- Law (0.93)
- Health & Medicine > Therapeutic Area (0.89)
- Media
  - Music (1.00)
  - Film (1.00)
- Leisure & Entertainment > Sports
  - Motorsports > Formula One (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Text Processing (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found