Decision Tree Learning
NeuroCADR: Drug Repurposing to Reveal Novel Anti-Epileptic Drug Candidates Through an Integrated Computational Approach
Drug repurposing is an emerging approach for drug discovery involving the reassignment of existing drugs for novel purposes. An alternative to the traditional de novo process of drug development, repurposed drugs are faster, cheaper, and less failure prone than drugs developed from traditional methods. Recently, drug repurposing has been performed in silico, in which databases of drugs and chemical information are used to determine interactions between target proteins and drug molecules to identify potential drug candidates. A proposed algorithm is NeuroCADR, a novel system for drug repurposing via a multi-pronged approach consisting of k-nearest neighbor algorithms (KNN), random forest classification, and decision trees. Data was sourced from several databases consisting of interactions between diseases, symptoms, genes, and affiliated drug molecules, which were then compiled into datasets expressed in binary. The proposed method displayed a high level of accuracy, outperforming nearly all in silico approaches. NeuroCADR was performed on epilepsy, a condition characterized by seizures, periods of time with bursts of uncontrolled electrical activity in brain cells. Existing drugs for epilepsy can be ineffective and expensive, revealing a need for new antiepileptic drugs. NeuroCADR identified novel drug candidates for epilepsy that can be further approved through clinical trials. The algorithm has the potential to determine possible drug combinations to prescribe a patient based on a patient's prior medical history. This project examines NeuroCADR, a novel approach to computational drug repurposing capable of revealing potential drug candidates in neurological diseases such as epilepsy.
Performance is not enough: the story told by a Rashomon quartet
Biecek, Przemyslaw, Baniecki, Hubert, Krzyzinski, Mateusz, Cook, Dianne
Predictive modelling is often reduced to finding the best model that optimizes a selected performance measure. But what if the second-best model describes the data in a completely different way? What about the third-best? Is it possible that the equally effective models describe different relationships in the data? Inspired by Anscombe's quartet, this paper introduces a Rashomon quartet, a four models built on synthetic dataset which have practically identical predictive performance. However, their visualization reveals distinct explanations of the relation between input variables and the target variable. The illustrative example aims to encourage the use of visualization to compare predictive models beyond their performance.
Improving the Validity of Decision Trees as Explanations
Nemecek, Jiri, Pevny, Tomas, Marecek, Jakub
Those can be competitive with deep neural networks on tabular data and, under some conditions, explainable. The explainability depends on the depth of the tree and the accuracy in each leaf of the tree. Decision trees containing leaves with unbalanced accuracy can provide misleading explanations. Low-accuracy leaves give less valid explanations, which could be interpreted as unfairness among explanations. Here, we train a shallow tree with the objective of minimizing the maximum misclassification error across each leaf node. Then, we extend each leaf with a separate tree-based model. The shallow tree provides a global explanation, while the overall statistical performance of the shallow tree with extended leaves improves upon decision trees of unlimited depth trained using classical methods (e.g., CART) and is comparable to state-of-the-art methods (e.g., well-tuned XGBoost).
Interpretable Outlier Summarization
Wang, Yu, Cao, Lei, Yan, Yizhou, Madden, Samuel
Outlier detection is critical in real applications to prevent financial fraud, defend network intrusions, or detecting imminent device failures. To reduce the human effort in evaluating outlier detection results and effectively turn the outliers into actionable insights, the users often expect a system to automatically produce interpretable summarizations of subgroups of outlier detection results. Unfortunately, to date no such systems exist. To fill this gap, we propose STAIR which learns a compact set of human understandable rules to summarize and explain the anomaly detection results. Rather than use the classical decision tree algorithms to produce these rules, STAIR proposes a new optimization objective to produce a small number of rules with least complexity, hence strong interpretability, to accurately summarize the detection results. The learning algorithm of STAIR produces a rule set by iteratively splitting the large rules and is optimal in maximizing this objective in each iteration. Moreover, to effectively handle high dimensional, highly complex data sets which are hard to summarize with simple rules, we propose a localized STAIR approach, called L-STAIR. Taking data locality into consideration, it simultaneously partitions data and learns a set of localized rules for each partition. Our experimental study on many outlier benchmark datasets shows that STAIR significantly reduces the complexity of the rules required to summarize the outlier detection results, thus more amenable for humans to understand and evaluate, compared to the decision tree methods.
Prediction Error Estimation in Random Forests
In this paper, error estimates of classification Random Forests are quantitatively assessed. Based on the initial theoretical framework built by Bates et al. (2023), the true error rate and expected error rate are theoretically and empirically investigated in the context of a variety of error estimation methods common to Random Forests. We show that in the classification case, Random Forests' estimates of prediction error is closer on average to the true error rate instead of the average prediction error. This is opposite the findings of Bates et al. (2023) which were given for logistic regression. We further show that this result holds across different error estimation strategies such as cross-validation, bagging, and data splitting.
Development and validation of an interpretable machine learning-based calculator for predicting 5-year weight trajectories after bariatric surgery: a multinational retrospective cohort SOPHIA study
Saux, Patrick, Bauvin, Pierre, Raverdy, Violeta, Teigny, Julien, Verkindt, Hélène, Soumphonphakdy, Tomy, Debert, Maxence, Jacobs, Anne, Jacobs, Daan, Monpellier, Valerie, Lee, Phong Ching, Lim, Chin Hong, Andersson-Assarsson, Johanna C, Carlsson, Lena, Svensson, Per-Arne, Galtier, Florence, Dezfoulian, Guelareh, Moldovanu, Mihaela, Andrieux, Severine, Couster, Julien, Lepage, Marie, Lembo, Erminia, Verrastro, Ornella, Robert, Maud, Salminen, Paulina, Mingrone, Geltrude, Peterli, Ralph, Cohen, Ricardo V, Zerrweck, Carlos, Nocca, David, Roux, Carel W Le, Caiazzo, Robert, Preux, Philippe, Pattou, François
Background Weight loss trajectories after bariatric surgery vary widely between individuals, and predicting weight loss before the operation remains challenging. We aimed to develop a model using machine learning to provide individual preoperative prediction of 5-year weight loss trajectories after surgery. Methods In this multinational retrospective observational study we enrolled adult participants (aged $\ge$18 years) from ten prospective cohorts (including ABOS [NCT01129297], BAREVAL [NCT02310178], the Swedish Obese Subjects study, and a large cohort from the Dutch Obesity Clinic [Nederlandse Obesitas Kliniek]) and two randomised trials (SleevePass [NCT00793143] and SM-BOSS [NCT00356213]) in Europe, the Americas, and Asia, with a 5 year followup after Roux-en-Y gastric bypass, sleeve gastrectomy, or gastric band. Patients with a previous history of bariatric surgery or large delays between scheduled and actual visits were excluded. The training cohort comprised patients from two centres in France (ABOS and BAREVAL). The primary outcome was BMI at 5 years. A model was developed using least absolute shrinkage and selection operator to select variables and the classification and regression trees algorithm to build interpretable regression trees. The performances of the model were assessed through the median absolute deviation (MAD) and root mean squared error (RMSE) of BMI. Findings10 231 patients from 12 centres in ten countries were included in the analysis, corresponding to 30 602 patient-years. Among participants in all 12 cohorts, 7701 (75$\bullet$3%) were female, 2530 (24$\bullet$7%) were male. Among 434 baseline attributes available in the training cohort, seven variables were selected: height, weight, intervention type, age, diabetes status, diabetes duration, and smoking status. At 5 years, across external testing cohorts the overall mean MAD BMI was 2$\bullet$8 kg/m${}^2$ (95% CI 2$\bullet$6-3$\bullet$0) and mean RMSE BMI was 4$\bullet$7 kg/m${}^2$ (4$\bullet$4-5$\bullet$0), and the mean difference between predicted and observed BMI was-0$\bullet$3 kg/m${}^2$ (SD 4$\bullet$7). This model is incorporated in an easy to use and interpretable web-based prediction tool to help inform clinical decision before surgery. InterpretationWe developed a machine learning-based model, which is internationally validated, for predicting individual 5-year weight loss trajectories after three common bariatric interventions.
Small Area Estimation with Random Forests and the LASSO
Michal, Victoire, Wakefield, Jon, Schmidt, Alexandra M., Cavanaugh, Alicia, Robinson, Brian, Baumgartner, Jill
We consider random forests and LASSO methods for model-based small area estimation when the number of areas with sampled data is a small fraction of the total areas for which estimates are required. Abundant auxiliary information is available for the sampled areas, from the survey, and for all areas, from an exterior source, and the goal is to use auxiliary variables to predict the outcome of interest. We compare areallevel random forests and LASSO approaches to a frequentist forward variable selection approach and a Bayesian shrinkage method. This work is motivated by Ghanaian data available from the sixth Living Standard Survey (GLSS) and the 2010 Population and Housing Census. We estimate the areal mean household log consumption using both datasets. The outcome variable is measured only in the GLSS for 3% of all the areas (136 out of 5019) and more than 170 potential covariates are available from both datasets. Among the four modelling methods considered, the Bayesian shrinkage performed the best in terms of bias, MSE and prediction interval coverages and scores, as assessed through a cross-validation study. We find substantial between-area variation, the log consumption areal point estimates showing a 1.3-fold variation across the GAMA region. The western areas are the poorest while the Accra Metropolitan Area district gathers the richest areas. In 2015, the United Nations (UN) released their 2030 agenda for sustainable development goals (SDGs) consisting of 17 goals, the first of which was to end poverty worldwide (Resolution, General Assembly and others, 2015). For their first SDG, the UN made seven guidelines explicit, including the implementation of "poverty eradication policies" at a disaggregated level. To that end, producing reliable and fine-grained pictures of socioeconomic status and income inequality is fundamental to help decision makers prioritise and target certain areas. These detailed maps help local communities understand their situation compared to their neighbours, which also helps when planning interventions (Bedi et al., 2007). In Ghana, household surveys are collected every few years to measure the living conditions of households across Ghanaian regions and districts and to monitor poverty.
On the Robustness of Random Forest Against Untargeted Data Poisoning: An Ensemble-Based Approach
Anisetti, Marco, Ardagna, Claudio A., Balestrucci, Alessandro, Bena, Nicola, Damiani, Ernesto, Yeun, Chan Yeob
Machine learning is becoming ubiquitous. From finance to medicine, machine learning models are boosting decision-making processes and even outperforming humans in some tasks. This huge progress in terms of prediction quality does not however find a counterpart in the security of such models and corresponding predictions, where perturbations of fractions of the training set (poisoning) can seriously undermine the model accuracy. Research on poisoning attacks and defenses received increasing attention in the last decade, leading to several promising solutions aiming to increase the robustness of machine learning. Among them, ensemble-based defenses, where different models are trained on portions of the training set and their predictions are then aggregated, provide strong theoretical guarantees at the price of a linear overhead. Surprisingly, ensemble-based defenses, which do not pose any restrictions on the base model, have not been applied to increase the robustness of random forest models. The work in this paper aims to fill in this gap by designing and implementing a novel hash-based ensemble approach that protects random forest against untargeted, random poisoning attacks. An extensive experimental evaluation measures the performance of our approach against a variety of attacks, as well as its sustainability in terms of resource consumption and performance, and compares it with a traditional monolithic model based on random forest. A final discussion presents our main findings and compares our approach with existing poisoning defenses targeting random forests.
How to choose the most appropriate centrality measure? A decision tree approach
Chebotarev, Pavel, Gubanov, Dmitry
Centrality metrics play a crucial role in network analysis, while the choice of specific measures significantly influences the accuracy of conclusions as each measure represents a unique concept of node importance. Among over 400 proposed indices, selecting the most suitable ones for specific applications remains a challenge. Existing approaches -- model-based, data-driven, and axiomatic -- have limitations, requiring association with models, training datasets, or restrictive axioms for each specific application. To address this, we introduce the culling method, which relies on the expert concept of centrality behavior on simple graphs. The culling method involves forming a set of candidate measures, generating a list of as small graphs as possible needed to distinguish the measures from each other, constructing a decision-tree survey, and identifying the measure consistent with the expert's concept. We apply this approach to a diverse set of 40 centralities, including novel kernel-based indices, and combine it with the axiomatic approach. Remarkably, only 13 small 1-trees are sufficient to separate all 40 measures, even for pairs of closely related ones. By adopting simple ordinal axioms like Self-consistency or Bridge axiom, the set of measures can be drastically reduced making the culling survey short. Applying the culling method provides insightful findings on some centrality indices, such as PageRank, Bridging, and dissimilarity-based Eigencentrality measures, among others. The proposed approach offers a cost-effective solution in terms of labor and time, complementing existing methods for measure selection, and providing deeper insights into the underlying mechanisms of centrality measures.
Hyperbolic Random Forests
Doorenbos, Lars, Márquez-Neila, Pablo, Sznitman, Raphael, Mettes, Pascal
Hyperbolic space is becoming a popular choice for representing data due to the hierarchical structure - whether implicit or explicit - of many real-world datasets. Along with it comes a need for algorithms capable of solving fundamental tasks, such as classification, in hyperbolic space. Recently, multiple papers have investigated hyperbolic alternatives to hyperplane-based classifiers, such as logistic regression and SVMs. While effective, these approaches struggle with more complex hierarchical data. We, therefore, propose to generalize the well-known random forests to hyperbolic space. We do this by redefining the notion of a split using horospheres. Since finding the globally optimal split is computationally intractable, we find candidate horospheres through a large-margin classifier. To make hyperbolic random forests work on multi-class data and imbalanced experiments, we furthermore outline a new method for combining classes based on their lowest common ancestor and a class-balanced version of the large-margin loss. Experiments on standard and new benchmarks show that our approach outperforms both conventional random forest algorithms and recent hyperbolic classifiers.