Representation-Based Data Quality Audits for Audio

Gonzalez-Jimenez, Alvaro, Gröger, Fabian, Wermelinger, Linda, Bürli, Andrin, Kastanis, Iason, Lionetti, Simone, Pouly, Marc

Oct-1-2025–arXiv.org Artificial Intelligence

ABSTRACT Data quality issues such as off-topic samples, near duplicates, and label errors often limit the performance of audio-based systems. This approach leverages self-supervised audio representations to identify common data quality issues, creating ranked review lists that surface distinct issues within a single, unified process. The method is benchmarked on the ESC-50, GTZAN, and a proprietary industrial dataset, using both synthetic and naturally occurring corruptions. The results demonstrate that this framework achieves state-of-the-art ranking performance, often outperforming issue-specific baselines and enabling significant annotation savings by efficiently guiding human review. Index T erms-- Data quality, dataset auditing, representation learning, near-duplicate detection, label errors 1. INTRODUCTION High-stakes audio applications, from predictive maintenance and safety monitoring to large-scale media search, depend on data that is abundant and trustworthy [1, 2, 3].

data mining, machine learning, selfclean, (20 more...)

arXiv.org Artificial Intelligence

Oct-1-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland > Basel-City > Basel (0.05)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology
  - Artificial Intelligence > Machine Learning (1.00)
  - Data Science
    - Data Mining > Anomaly Detection (0.47)
    - Data Quality (1.00)