DataSIR: ABenchmark Dataset for Sensitive Information Recognition
–Neural Information Processing Systems
With the rapid development of artificial intelligence technologies, the demand for training data has surged, exacerbating risks of data leakage. Despite increasing incidents and costs associated with such leaks, data leakage prevention (DLP) technologies lag behind evolving evasion techniques that bypass existing sensitive information recognition (SIR) models. Current datasets lack comprehensive coverage of these adversarial transformations, limiting the evaluation of robust SIR systems. To address this gap, we introduce DataSIR, a benchmark dataset specifically designed to evaluate SIR models on sensitive data subjected to diverse format transformations. We curate 26 sensitive data categories based on multiple international regulations, and collect 131,890 original samples correspondingly.
Neural Information Processing Systems
Jun-16-2026, 09:13:35 GMT
- Country:
- North America > United States (0.93)
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Technology: