Goto

Collaborating Authors

 wikipedia biography


Re-identification of De-identified Documents with Autoregressive Infilling

Charpentier, Lucas Georges Gabriel, Lison, Pierre

arXiv.org Artificial Intelligence

Documents revealing sensitive information about individuals must typically be de-identified. This de-identification is often done by masking all mentions of personally identifiable information (PII), thereby making it more difficult to uncover the identity of the person(s) in question. To investigate the robustness of de-identification methods, we present a novel, RAG-inspired approach that attempts the reverse process of re-identification based on a database of documents representing background knowledge. Given a text in which personal identifiers have been masked, the re-identification proceeds in two steps. A retriever first selects from the background knowledge passages deemed relevant for the re-identification. Those passages are then provided to an infilling model which seeks to infer the original content of each text span. This process is repeated until all masked spans are replaced. We evaluate the re-identification on three datasets (Wikipedia biographies, court rulings and clinical notes). Results show that (1) as many as 80% of de-identified text spans can be successfully recovered and (2) the re-identification accuracy increases along with the level of background knowledge.


Neural Text Sanitization with Privacy Risk Indicators: An Empirical Analysis

Papadopoulou, Anthi, Lison, Pierre, Anderson, Mark, Øvrelid, Lilja, Pilán, Ildikó

arXiv.org Artificial Intelligence

Text sanitization is the task of redacting a document to mask all occurrences of (direct or indirect) personal identifiers, with the goal of concealing the identity of the individual(s) referred in it. In this paper, we consider a two-step approach to text sanitization and provide a detailed analysis of its empirical performance on two recently published datasets: the Text Anonymization Benchmark (Pil\'an et al., 2022) and a collection of Wikipedia biographies (Papadopoulou et al., 2022). The text sanitization process starts with a privacy-oriented entity recognizer that seeks to determine the text spans expressing identifiable personal information. This privacy-oriented entity recognizer is trained by combining a standard named entity recognition model with a gazetteer populated by person-related terms extracted from Wikidata. The second step of the text sanitization process consists in assessing the privacy risk associated with each detected text span, either isolated or in combination with other text spans. We present five distinct indicators of the re-identification risk, respectively based on language model probabilities, text span classification, sequence labelling, perturbations, and web search. We provide a contrastive analysis of each privacy indicator and highlight their benefits and limitations, notably in relation to the available labeled data.


Wikibio: a Semantic Resource for the Intersectional Analysis of Biographical Events

Stranisci, Marco Antonio, Damiano, Rossana, Mensa, Enrico, Patti, Viviana, Radicioni, Daniele, Caselli, Tommaso

arXiv.org Artificial Intelligence

Biographical event detection is a relevant task for the exploration and comparison of the ways in which people's lives are told and represented. In this sense, it may support several applications in digital humanities and in works aimed at exploring bias about minoritized groups. Despite that, there are no corpora and models specifically designed for this task. In this paper we fill this gap by presenting a new corpus annotated for biographical event detection. The corpus, which includes 20 Wikipedia biographies, was compared with five existing corpora to train a model for the biographical event detection task. The model was able to detect all mentions of the target-entity in a biography with an F-score of 0.808 and the entity-related events with an F-score of 0.859. Finally, the model was used for performing an analysis of biases about women and non-Western people in Wikipedia biographies.


Meta AI's open-source system attempts to right gender bias in Wikipedia biographies

#artificialintelligence

We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - August 3. Join AI and data leaders for insightful talks and exciting networking opportunities. By this point, it's become reflexive: When searching for something on Google, Wikipedia is the de facto go-to first page. The website is consistently among the top 10 most-visited websites in the world. Yet, not all changemakers and historical figures are equally represented on the dominant web encyclopedia. Just 20% of Wikipedia biographies are about women.