Goto

Collaborating Authors

 Ensemble Learning


Pixel-wise classification in graphene-detection with tree-based machine learning algorithms

arXiv.org Artificial Intelligence

Mechanical exfoliation of graphene and its identification by optical inspection is one of the milestones in condensed matter physics that sparked the field of 2D materials. Finding regions of interest from the entire sample space and identification of layer number is a routine task potentially amenable to automatization. We propose supervised pixel-wise classification methods showing a high performance even with a small number of training image datasets that require short computational time without GPU. We introduce four different tree-based machine learning algorithms -- decision tree, random forest, extreme gradient boost, and light gradient boosting machine. We train them with five optical microscopy images of graphene, and evaluate their performances with multiple metrics and indices. We also discuss combinatorial machine learning models between the three single classifiers and assess their performances in identification and reliability. The code developed in this paper is open to the public and will be released at github.com/gjung-group/Graphene_segmentation.


Hybrid Approach to Identify Druglikeness Leading Compounds against COVID-19 3CL Protease

arXiv.org Artificial Intelligence

SARS-COV-2 is a positive single-strand RNA-based macromolecule that has caused the death of more than 6.3 million people since June 2022. Moreover, by disturbing global supply chains through lockdown, the virus has indirectly caused devastating damage to the global economy. It is vital to design and develop drugs for this virus and its various variants. In this paper, we developed an in-silico study-based hybrid framework to repurpose existing therapeutic agents in finding drug-like bioactive molecules that would cure Covid-19. We employed the Lipinski rules on the retrieved molecules from the ChEMBL database and found 133 drug-likeness bioactive molecules against SARS coronavirus 3CL Protease. Based on standard IC50, the dataset was divided into three classes active, inactive, and intermediate. Our comparative analysis demonstrated that the proposed Extra Tree Regressor (ETR) based QSAR model has improved prediction results related to the bioactivity of chemical compounds as compared to Gradient Boosting, XGBoost, Support Vector, Decision Tree, and Random Forest based regressor models. ADMET analysis is carried out to identify thirteen bioactive molecules with ChEMBL IDs 187460, 190743, 222234, 222628, 222735, 222769, 222840, 222893, 225515, 358279, 363535, 365134 and 426898. These molecules are highly suitable drug candidates for SARS-COV-2 3CL Protease. In the next step, the efficacy of bioactive molecules is computed in terms of binding affinity using molecular docking and then shortlisted six bioactive molecules with ChEMBL IDs 187460, 222769, 225515, 358279, 363535, and 365134. These molecules can be suitable drug candidates for SARS-COV-2. It is anticipated that the pharmacologist/drug manufacturer would further investigate these six molecules to find suitable drug candidates for SARS-COV-2. They can adopt these promising compounds for their downstream drug development stages.


An Empirical Analysis of the Efficacy of Different Sampling Techniques for Imbalanced Classification

arXiv.org Artificial Intelligence

Learning from imbalanced data is a challenging task. Standard classification algorithms tend to perform poorly when trained on imbalanced data. Some special strategies need to be adopted, either by modifying the data distribution or by redesigning the underlying classification algorithm to achieve desirable performance. The prevalence of imbalance in real-world datasets has led to the creation of a multitude of strategies for the class imbalance issue. However, not all the strategies are useful or provide good performance in different imbalance scenarios. There are numerous approaches to dealing with imbalanced data, but the efficacy of such techniques or an experimental comparison among those techniques has not been conducted. In this study, we present a comprehensive analysis of 26 popular sampling techniques to understand their effectiveness in dealing with imbalanced data. Rigorous experiments have been conducted on 50 datasets with different degrees of imbalance to thoroughly investigate the performance of these techniques. A detailed discussion of the advantages and limitations of the techniques, as well as how to overcome such limitations, has been presented. We identify some critical factors that affect the sampling strategies and provide recommendations on how to choose an appropriate sampling technique for a particular application.


Multiple Outputs -- xgboost 1.6.2 documentation

#artificialintelligence

Starting from version 1.6, XGBoost has experimental support for multi-output regression and multi-label classification with Python package. Multi-label classification usually refers to targets that have multiple non-exclusive class labels. For instance, a movie can be simultaneously classified as both sci-fi and comedy. For detailed explanation of terminologies related to different multi-output models please refer to the scikit-learn user guide. Internally, XGBoost builds one model for each target similar to sklearn meta estimators, with the added benefit of reusing data and other integrated features like SHAP.


MetaRF: Differentiable Random Forest for Reaction Yield Prediction with a Few Trails

arXiv.org Artificial Intelligence

Artificial intelligence has deeply revolutionized the field of medicinal chemistry with many impressive applications, but the success of these applications requires a massive amount of training samples with high-quality annotations, which seriously limits the wide usage of data-driven methods. In this paper, we focus on the reaction yield prediction problem, which assists chemists in selecting high-yield reactions in a new chemical space only with a few experimental trials. To attack this challenge, we first put forth MetaRF, an attention-based differentiable random forest model specially designed for the few-shot yield prediction, where the attention weight of a random forest is automatically optimized by the meta-learning framework and can be quickly adapted to predict the performance of new reagents while given a few additional samples. To improve the few-shot learning performance, we further introduce a dimension-reduction based sampling method to determine valuable samples to be experimentally tested and then learned. Our methodology is evaluated on three different datasets and acquires satisfactory performance on few-shot prediction. In high-throughput experimentation (HTE) datasets, the average yield of our methodology's top 10 high-yield reactions is relatively close to the results of ideal yield selection.


Mechanical Properties Prediction in Metal Additive Manufacturing Using Machine Learning

arXiv.org Artificial Intelligence

Predicting mechanical properties in metal additive manufacturing (MAM) is vital to ensure the printed parts' performance, reliability, and whether they can fulfill requirements for a specific application. Conducting experiments to estimate mechanical properties in MAM processes, however, is a laborious and expensive task. Also, they can solely be designed for a particular material in a certain MAM process. Nonetheless, Machine learning (ML) methods, which are more flexible and cost-effective solutions, can be utilized to predict mechanical properties based on the processing parameters and material properties. To this end, in this work, a comprehensive framework for benchmarking ML for mechanical properties is introduced. An extensive experimental dataset is collected from more than 90 MAM articles and 140 MAM companies' data sheets containing MAM processing conditions, machines, materials, and resultant mechanical properties, including yield strength, ultimate tensile strength, elastic modulus, elongation, hardness as well as surface roughness. Physics-aware MAM featurization, adjustable ML models, and evaluation metrics are proposed to construct a comprehensive learning framework for mechanical properties prediction. Additionally, the Explainable AI method, i.e., SHAP analysis was studied to explain and interpret the ML models' predicted values for mechanical properties. Moreover, data-driven explicit models have been identified to estimate mechanical properties based on the processing parameters and material properties with more interpretability as compared to the employed ML models.


A Novel Hybrid Sampling Framework for Imbalanced Learning

arXiv.org Artificial Intelligence

Class imbalance is a frequently occurring scenario in classification tasks. Learning from imbalanced data poses a major challenge, which has instigated a lot of research in this area. Data preprocessing using sampling techniques is a standard approach to deal with the imbalance present in the data. Since standard classification algorithms do not perform well on imbalanced data, the dataset needs to be adequately balanced before training. This can be accomplished by oversampling the minority class or undersampling the majority class. In this study, a novel hybrid sampling algorithm has been proposed. To overcome the limitations of the sampling techniques while ensuring the quality of the retained sampled dataset, a sophisticated framework has been developed to properly combine three different sampling techniques. Neighborhood Cleaning rule is first applied to reduce the imbalance. Random undersampling is then strategically coupled with the SMOTE algorithm to obtain an optimal balance in the dataset. This proposed hybrid methodology, termed "SMOTE-RUS-NC", has been compared with other state-of-the-art sampling techniques. The strategy is further incorporated into the ensemble learning framework to obtain a more robust classification algorithm, termed "SRN-BRF". Rigorous experimentation has been conducted on 26 imbalanced datasets with varying degrees of imbalance. In virtually all datasets, the proposed two algorithms outperformed existing sampling strategies, in many cases by a substantial margin. Especially in highly imbalanced datasets where popular sampling techniques failed utterly, they achieved unparalleled performance. The superior results obtained demonstrate the efficacy of the proposed models and their potential to be powerful sampling algorithms in imbalanced domain.


[100%OFF] Decision Trees, Random Forests, Bagging & XGBoost: R Studio

#artificialintelligence

You're looking for a complete Decision tree course that teaches you everything you need to create a Decision tree/ Random Forest/ XGBoost model in R, right? You've found the right Decision Trees and tree based advanced techniques course! How this course will help you? A Verifiable Certificate of Completion is presented to all students who undertake this Machine learning advanced course. If you are a business manager or an executive, or a student who wants to learn and apply machine learning in Real world problems of business, this course will give you a solid base for that by teaching you some of the advanced technique of machine learning, which are Decision tree, Random Forest, Bagging, AdaBoost and XGBoost.


Machine learning: A non-invasive prediction method for gastric cancer based on a survey of lifestyle behaviors

#artificialintelligence

Gastric cancer remains an enormous threat to human health. It is extremely significant to make a clear diagnosis and timely treatment of gastrointestinal tumors. The traditional diagnosis method (endoscope, surgery, and pathological tissue extraction) of gastric cancer is usually invasive, expensive, and time-consuming. The machine learning method is fast and low-cost, which breaks through the limitations of the traditional methods as we can apply the machine learning method to diagnose gastric cancer. This work aims to construct a cheap, non-invasive, rapid, and high-precision gastric cancer diagnostic model using personal behavioral lifestyles and non-invasive characteristics. A retrospective study was implemented on 3,630 participants. The developed models (extreme gradient boosting, decision tree, random forest, and logistic regression) were evaluated by cross-validation and the generalization ability in our test set. We found that the model developed using fingerprints based on the extreme gradient boosting (XGBoost) algorithm produced better results compared with the other models. The overall accuracy of which test set was 85.7%, AUC was 89.6%, sensitivity 78.7%, specificity 76.9%, and positive predictive values 73.8%, verifying that the proposed model has significant medical value and good application prospects.


New drugs and stock market: how to predict pharma market reaction to clinical trial announcements

arXiv.org Artificial Intelligence

Pharmaceutical companies operate in a strictly regulated and highly risky environment in which a single slip can lead to serious financial implications. Accordingly, the announcements of clinical trial results tend to determine the future course of events, hence being closely monitored by the public. In this work, we provide statistical evidence for the result promulgation influence on the public pharma market value. Whereas most works focus on retrospective impact analysis, the present research aims to predict the numerical values of announcement-induced changes in stock prices. For this purpose, we develop a pipeline that includes a BERT-based model for extracting sentiment polarity of announcements, a Temporal Fusion Transformer for forecasting the expected return, a graph convolution network for capturing event relationships, and gradient boosting for predicting the price change. The challenge of the problem lies in inherently different patterns of responses to positive and negative announcements, reflected in a stronger and more pronounced reaction to the negative news. Moreover, such phenomenon as the drop in stocks after the positive announcements affirms the counterintuitiveness of the price behavior. Importantly, we discover two crucial factors that should be considered while working within a predictive framework. The first factor is the drug portfolio size of the company, indicating the greater susceptibility to an announcement in the case of small drug diversification. The second one is the network effect of the events related to the same company or nosology. All findings and insights are gained on the basis of one of the biggest FDA (the Food and Drug Administration) announcement datasets, consisting of 5436 clinical trial announcements from 681 companies over the last five years.