Protecting De-identified Documents from Search-based Linkage Attacks

Oct-9-2025–arXiv.org Artificial Intelligence

While de-identification models can help conceal the identity of the individual(s) mentioned in a document, they fail to address linkage risks, defined as the potential to map the de-identified text back to its source. One straightforward way to perform such linkages is to extract phrases from the de-identified document and then check their presence in the original dataset. This paper presents a method to counter search-based linkage attacks while preserving the semantic integrity of the text. The method proceeds in two steps. We first construct an inverted index of the N-grams occurring in the document collection, making it possible to efficiently determine which N-grams appear in less than $k$ documents (either alone or in combination with other N-grams). An LLM-based rewriter is then iteratively queried to reformulate those spans until linkage is no longer possible. Experimental results on a collection of court cases show that the method is able to effectively prevent search-based linkages while remaining faithful to the original content.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-9-2025

arXiv.org PDF

Add feedback

Country:
- North America (0.68)
- Europe
  - United Kingdom > England (0.28)
  - Poland > Lublin Province (0.19)

Genre:
- Research Report > Experimental Study (0.82)

Industry:
- Information Technology > Security & Privacy (1.00)
- Government (0.95)
- Law > Criminal Law (0.69)
- Law Enforcement & Public Safety
  - Crime Prevention & Enforcement (1.00)
  - Corrections (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Representation & Reasoning > Search (0.82)
  - Natural Language
    - Text Processing (0.94)
    - Large Language Model (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found