Step-E: A Differentiable Data Cleaning Framework for Robust Learning with Noisy Labels

Nov-24-2025–arXiv.org Artificial Intelligence

Modern deep networks achieve impressive performance when trained on large, clean, and carefully curated datasets. In realistic data mining scenarios, however, labels come from heterogeneous sources such as crowdsourcing, weak supervision, or heuristic rules and are therefore noisy [18, 3]. Human annotation errors, ambiguous images, and domain shifts all contribute to mislabeled or outlier samples that can harm generalization. In image classification, for example, web-scale datasets often contain wrong tags or near-duplicate images with conflicting labels; in user-generated content analysis, spam or off-topic posts corrupt the training distribution. Data cleaning is widely recognized as crucial [15] but is typically performed before model training using hand-crafted rules or separate anomaly detectors [9, 16]. This two-stage design has two drawbacks: (i) it requires domain expertise or extra supervision to specify cleaning rules and thresholds; (ii) it decouples cleaning from model optimization, so the decisions do not directly leverage discriminative feedback from the task model. Some high-loss samples may still be informative "hard cases," whereas others are truly corrupted and should be discarded. We explore a different paradigm: can the model learn which samples to trust during training, treating data cleaning as an integral, differentiable part of optimization?

data mining, data quality, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Nov-24-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Technology:
- Information Technology
  - Artificial Intelligence > Machine Learning (1.00)
  - Data Science
    - Data Mining (1.00)
    - Data Quality > Data Cleaning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found