An Evaluation Study of Hybrid Methods for Multilingual PII Detection

Rajgarhia, Harshit, Gupta, Suryam, Shaik, Asif, Kumar, Gulipalli Praveen, Santhoshraj, Y, Nishitha, Sanka Nithya Tanvy, Mukherji, Abhishek

Oct-10-2025–arXiv.org Artificial Intelligence

The detection of Personally Identifiable Information (PII) is critical for privacy compliance but remains challenging in low-resource languages due to linguistic diversity and limited annotated data. We present RECAP, a hybrid framework that combines deterministic regular expressions with context-aware large language models (LLMs) for scalable PII detection across 13 low-resource locales. RECAP's modular design supports over 300 entity types without retraining, using a three-phase refinement pipeline for disambiguation and filtering. Benchmarked with nervaluate, our system outperforms fine-tuned NER models by 82% and zero-shot LLMs by 17% in weighted F1-score. This work offers a scalable and adaptable solution for efficient PII detection in compliance-focused applications.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Oct-10-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- Asia (1.00)
- North America > United States (0.68)

Genre:
- Research Report (0.50)

Industry:
- Information Technology > Security & Privacy (1.00)
- Law (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found