LARP: Learner-Agnostic Robust Data Prefiltering
Minchev, Kristian, Dimitrov, Dimitar Iliev, Konstantinov, Nikola
The widespread availability of large public datasets is a key factor behind the recent successes of statistical inference and machine learning methods. However, these datasets often contain some low-quality or contaminated data, to which many learning procedures are sensitive. Therefore, the question of whether and how public datasets should be prefiltered to facilitate accurate downstream learning arises. On a technical level this requires the construction of principled data prefiltering methods which are learner-agnostic robust, in the sense of provably protecting a set of pre-specified downstream learners from corrupted data. In this work, we formalize the problem of Learner-Agnostic Robust data Prefiltering (LARP), which aims at finding prefiltering procedures that minimize a worst-case loss over a pre-specified set of learners. We first instantiate our framework in the context of scalar mean estimation with Huber estimators under the Huber data contamination model. We provide a hardness result on a specific problem instance and analyze several natural prefiltering procedures. Our theoretical results indicate that performing LARP on a heterogeneous set of learners leads to some loss in model performance compared to the alternative of prefiltering data for each learner/use-case individually. We explore the resulting utility loss and its dependence on the problem parameters via extensive experiments on real-world image and tabular data, observing statistically significant reduction in utility. Finally, we model the trade-off between the utility drop and the cost of repeated (learner-specific) prefiltering within a game-theoretic framework and showcase benefits of LARP for large datasets.
Jul-11-2025
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- Bulgaria > Sofia City Province
- Sofia (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Bulgaria > Sofia City Province
- North America
- Canada > Ontario
- Toronto (0.14)
- United States (0.14)
- Canada > Ontario
- Asia > Middle East
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Health & Medicine (0.93)
- Information Technology (1.00)
- Technology: