Predicting the activity of chemical compounds based on machine learning approaches

Tu, Do Hoang, Van Lang, Tran, Xuyen, Pham Cong, Long, Le Mau

arXiv.org Artificial Intelligence 

ABSTRACT -- Exploring methods and techniques of machine learning (ML) to address specific challenges in various fields is essential. In this work, we tackle a problem in the domain of Cheminformatics; that is, providing a suitable solution to aid in predicting the activity of a chemical compound to the best extent possible. To address the problem at hand, this study conducts experiments on 100 different combinations of existing techniques. These solutions are then selected based on a set of criteria that includes the G-means, F1-score, and AUC metrics. The results have been tested on a dataset of about 10,000 chemical compounds from PubChem that have been classified according to their activity. I. INTRODUCTION In datasets used in biological experiments for measuring the activity of various compounds against different biological targets, often used in screening, there is usually a significant imbalance between active and inactive compounds, with the number of inactive data points being much larger. Therefore, training requires the use of suitable machine learning models. Additionally, preprocessing before using machine learning methods for training is also a crucial issue. The following issues are approached to address the problem of predicting the activity of chemical compounds using chemistry-related datasets: Investigating the dependency of attributes or features in the dataset to potentially reduce the number of features. This can be done using methods such as ANOVA F-test to assess the dependency of each feature on the target variable or by using correlation coefficients.