High-Power Training Data Identification with Provable Statistical Guarantees
Liu, Zhenlong, Zeng, Hao, Huang, Weiran, Wei, Hongxin
–arXiv.org Artificial Intelligence
The conventional approaches treat it as a simple binary classification task without statistical guarantees. A recent approach is designed to control the false discovery rate (FDR), but its guarantees rely on strong, easily violated assumptions. In this paper, we introduce Provable Training Data Identification (PTDI), a rigorous method that identifies a set of training data with strict false discovery rate (FDR) control. Specifically, our method computes p-values for each data point using a set of known unseen data, and then constructs a conservative estimator for the data usage proportion of the test set, which allows us to scale these p-values. Our approach then selects the final set of training data by identifying all points whose scaled p-values fall below a data-dependent threshold. This entire procedure enables the discovery of training data with provable, strict FDR control and significantly boosted power. Extensive experiments across a wide range of models (LLMs and VLMs), and datasets demonstrate that PTDI strictly controls the FDR and achieves higher power. These concerns raise the importance of identifying a specific, well-defined set of data allegedly used in training. To resolve such high-stakes disputes, claims must be supported by credible evidence that strictly controls the risk of false positives. This underscores the need for methods that provide rigorous statistical guarantees for identifying training data.
arXiv.org Artificial Intelligence
Oct-14-2025
- Country:
- Asia
- Europe
- North America
- Canada > Ontario
- Toronto (0.04)
- Mexico > Mexico City
- Mexico City (0.04)
- United States
- California (0.14)
- Florida > Miami-Dade County
- Miami (0.04)
- Canada > Ontario
- Genre:
- Research Report > Experimental Study (0.77)
- Industry:
- Government > Regional Government (0.67)
- Information Technology > Security & Privacy (1.00)
- Law (1.00)
- Technology: