Enhancing Machine Learning Performance through Intelligent Data Quality Assessment: An Unsupervised Data-centric Framework
Rahal, Manal, Ahmed, Bestoun S., Szabados, Gergely, Fornstedt, Torgny, Samuelsson, Jorgen
Poor data quality limits the advantageous power of Machine Learning (ML) and weakens high-performing ML software systems. Nowadays, data are more prone to the risk of poor quality due to their increasing volume and complexity. Therefore, tedious and time-consuming work goes into data preparation and improvement before moving further in the ML pipeline. To address this challenge, we propose an intelligent data-centric evaluation framework that can identify high-quality data and improve the performance of an ML system. The proposed framework combines the curation of quality measurements and unsupervised learning to distinguish high- and low-quality data. The framework is designed to integrate flexible and general-purpose methods so that it is deployed in various domains and applications. To validate the outcomes of the designed framework, we implemented it in a real-world use case from the field of analytical chemistry, where it is tested on three datasets of anti-sense oligonucleotides. A domain expert is consulted to identify the relevant quality measurements and evaluate the outcomes of the framework. The results show that the quality-centric data evaluation framework identifies the characteristics of high-quality data that guide the conduct of efficient laboratory experiments and consequently improve the performance of the ML system.
Feb-18-2025
- Country:
- Asia > China
- Zhejiang Province > Hangzhou (0.04)
- Europe
- Czechia > Prague (0.04)
- Netherlands > North Holland
- Amsterdam (0.04)
- Sweden > Värmland County
- Karlstad (0.04)
- North America > United States
- New York > New York County > New York City (0.04)
- Asia > China
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Information Technology (0.67)
- Technology: