Goto

Collaborating Authors

 selfclean


Intrinsic Self-Supervision for Data Quality Audits

Neural Information Processing Systems

Requests for name changes in the electronic proceedings will be accepted with no questions asked. However name changes may cause bibliographic tracking issues. Authors are asked to consider this carefully and discuss it with their co-authors prior to requesting a name change in the electronic proceedings. Use the Report an Issue link to request a name change.


Representation-Based Data Quality Audits for Audio

Gonzalez-Jimenez, Alvaro, Gröger, Fabian, Wermelinger, Linda, Bürli, Andrin, Kastanis, Iason, Lionetti, Simone, Pouly, Marc

arXiv.org Artificial Intelligence

ABSTRACT Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This approach leverages self-supervised audio representations to identify common data quality issues, creating ranked review lists that surface distinct issues within a single, unified process. The method is benchmarked on the ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review. Index T erms-- Data quality, dataset auditing, representation learning, near-duplicate detection, label errors 1. INTRODUCTION High-stakes audio applications, from predictive maintenance and safety monitoring to large-scale media search, depend on data that is abundant and trustworthy [1, 2, 3].