Protecting De-identified Documents from Search-based Linkage Attacks
–arXiv.org Artificial Intelligence
While de-identification models can help conceal the identity of the individual(s) mentioned in a document, they fail to address linkage risks, defined as the potential to map the de-identified text back to its source. One straightforward way to perform such linkages is to extract phrases from the de-identified document and then check their presence in the original dataset. This paper presents a method to counter search-based linkage attacks while preserving the semantic integrity of the text. The method proceeds in two steps. We first construct an inverted index of the N-grams occurring in the document collection, making it possible to efficiently determine which N-grams appear in less than $k$ documents (either alone or in combination with other N-grams). An LLM-based rewriter is then iteratively queried to reformulate those spans until linkage is no longer possible. Experimental results on a collection of court cases show that the method is able to effectively prevent search-based linkages while remaining faithful to the original content.
arXiv.org Artificial Intelligence
Oct-9-2025
- Country:
- Asia
- China > Hong Kong (0.04)
- Middle East > Israel
- Central District > Ramla (0.04)
- Russia > Siberian Federal District
- Novosibirsk Oblast > Novosibirsk (0.04)
- Europe
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Germany > Bavaria
- Upper Bavaria > Munich (0.04)
- Norway > Eastern Norway
- Oslo (0.04)
- Poland
- Lublin Province > Lublin (0.08)
- Masovia Province > Warsaw (0.04)
- Opole Province > Opole (0.06)
- Russia > Central Federal District
- Moscow Oblast > Moscow (0.04)
- Spain (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.14)
- Belgium > Brussels-Capital Region
- North America
- Canada > Ontario
- Toronto (0.04)
- Dominican Republic (0.04)
- Montserrat (0.04)
- United States (0.05)
- Canada > Ontario
- South America > Chile
- Asia
- Genre:
- Research Report > Experimental Study (0.82)
- Industry:
- Government (0.95)
- Information Technology > Security & Privacy (1.00)
- Law > Criminal Law (0.69)
- Law Enforcement & Public Safety
- Corrections (1.00)
- Crime Prevention & Enforcement (1.00)
- Technology: