Scalable multilingual PII annotation for responsible AI in LLMs

Meena, Bharti, Skubisz, Joanna, Rajgarhia, Harshit, Dave, Nand, Ganesh, Kiran, Dalmia, Shivali, Mukherji, Abhishek, Sundarababu, Vasudevan

Oct-13-2025–arXiv.org Artificial Intelligence

Abstract--As Large Language Models (LLMs) gain wider adoption, ensuring their reliable handling of Personally Identifiable Information (PII) across diverse regulatory contexts has become essential. This work introduces a scalable multilingual data curation framework designed for high-quality PII annotation across 13 underrepresented locales (Table I), covering approximately 336 locale-specific PII types. Our phased, human-in-the-loop annotation methodology combines linguistic expertise with rigorous quality assurance, leading to substantial improvements in recall and false positive rates from pilot, training, and production phases. Beyond reporting empirical gains, we highlight common annotator challenges in multilingual PII labeling and demonstrate how iterative, analytics-driven pipelines can enhance both annotation quality and downstream model reliability. I. Introduction A. PII Data Protection The surge in user-generated content has led to vast textual corpora containing hidden instances of Personally Identifiable Information (PII) in application forms, support tickets, reviews and social media posts [1]. PII--such as NAME, SSN, and PHONE NUMBER--poses significant privacy risks if not handled correctly. Its compromise can result in identity theft, financial fraud, and unauthorized access to sensitive data [2].

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-13-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- Asia (0.68)
- North America > United States (0.46)

Genre:
- Research Report (0.40)

Industry:
- Law (1.00)
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.93)
  - Issues > Social & Ethical Issues (0.77)
  - Machine Learning
    - Performance Analysis > Accuracy (0.90)
    - Neural Networks (0.68)