ClustEm4Ano: Clustering Text Embeddings of Nominal Textual Attributes for Microdata Anonymization

Aufschläger, Robert, Wilhelm, Sebastian, Heigl, Michael, Schramm, Martin

Dec-17-2024–arXiv.org Artificial Intelligence

This work introduces ClustEm4Ano, an anonymization pipeline that can be used for generalization and suppression-based anonymization of nominal textual tabular data. It automatically generates value generalization hierarchies (VGHs) that, in turn, can be used to generalize attributes in quasi-identifiers. The pipeline leverages embeddings to generate semantically close value generalizations through iterative clustering. We applied KMeans and Hierarchical Agglomerative Clustering on $13$ different predefined text embeddings (both open and closed-source (via APIs)). Our approach is experimentally tested on a well-known benchmark dataset for anonymization: The UCI Machine Learning Repository's Adult dataset. ClustEm4Ano supports anonymization procedures by offering more possibilities compared to using arbitrarily chosen VGHs. Experiments demonstrate that these VGHs can outperform manually constructed ones in terms of downstream efficacy (especially for small $k$-anonymity ($2 \leq k \leq 30$)) and therefore can foster the quality of anonymized datasets. Our implementation is made public.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Dec-17-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
- Europe
  - Germany (0.14)
  - Middle East > Malta
    - Port Region > Southern Harbour District > Valletta (0.04)
  - Italy > Veneto
    - Venice (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
- Asia > Middle East
  - Qatar > Ad-Dawhah > Doha (0.04)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology
  - Security & Privacy (1.00)
  - Data Science (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Natural Language (1.00)
    - Machine Learning
      - Statistical Learning > Clustering (1.00)
      - Neural Networks (0.94)