Collaborating Authors


What is Machine Learning? A Primer for the Epidemiologist


Machine learning is a branch of computer science that has the potential to transform epidemiologic sciences. Amid a growing focus on "Big Data," it offers epidemiologists new tools to tackle problems for which classical methods are not well-suited. In order to critically evaluate the value of integrating machine learning algorithms and existing methods, however, it is essential to address language and technical barriers between the two fields that can make it difficult for epidemiologists to read and assess machine learning studies. Here, we provide an overview of the concepts and terminology used in machine learning literature, which encompasses a diverse set of tools with goals ranging from prediction to classification to clustering. We provide a brief introduction to 5 common machine learning algorithms and 4 ensemble-based approaches. We then summarize epidemiologic applications of machine learning techniques in the published literature. We recommend approaches to incorporate machine learning in epidemiologic research and discuss opportunities and challenges for integrating machine learning and existing epidemiologic research methods. Machine learning is a branch of computer science that broadly aims to enable computers to "learn" without being directly programmed (1). It has origins in the artificial intelligence movement of the 1950s and emphasizes practical objectives and applications, particularly prediction and optimization. Computers "learn" in machine learning by improving their performance at tasks through "experience" (2, p. xv). In practice, "experience" usually means fitting to data; hence, there is not a clear boundary between machine learning and statistical approaches. Indeed, whether a given methodology is considered "machine learning" or "statistical" often reflects its history as much as genuine differences, and many algorithms (e.g., least absolute shrinkage and selection operator (LASSO), stepwise regression) may or may not be considered machine learning depending on who you ask. Still, despite methodological similarities, machine learning is philosophically and practically distinguishable. At the liberty of (considerable) oversimplification, machine learning generally emphasizes predictive accuracy over hypothesis-driven inference, usually focusing on large, high-dimensional (i.e., having many covariates) data sets (3, 4). Regardless of the precise distinction between approaches, in practice, machine learning offers epidemiologists important tools. In particular, a growing focus on "Big Data" emphasizes problems and data sets for which machine learning algorithms excel while more commonly used statistical approaches struggle. This primer provides a basic introduction to machine learning with the aim of providing readers a foundation for critically reading studies based on these methods and a jumping-off point for those interested in using machine learning techniques in epidemiologic research.

Importance measures derived from random forests: characterisation and extension Machine Learning

Nowadays new technologies, and especially artificial intelligence, are more and more established in our society. Big data analysis and machine learning, two sub-fields of artificial intelligence, are at the core of many recent breakthroughs in many application fields (e.g., medicine, communication, finance, ...), including some that are strongly related to our day-to-day life (e.g., social networks, computers, smartphones, ...). In machine learning, significant improvements are usually achieved at the price of an increasing computational complexity and thanks to bigger datasets. Currently, cutting-edge models built by the most advanced machine learning algorithms typically became simultaneously very efficient and profitable but also extremely complex. Their complexity is to such an extent that these models are commonly seen as black-boxes providing a prediction or a decision which can not be interpreted or justified. Nevertheless, whether these models are used autonomously or as a simple decision-making support tool, they are already being used in machine learning applications where health and human life are at stake. Therefore, it appears to be an obvious necessity not to blindly believe everything coming out of those models without a detailed understanding of their predictions or decisions. Accordingly, this thesis aims at improving the interpretability of models built by a specific family of machine learning algorithms, the so-called tree-based methods. Several mechanisms have been proposed to interpret these models and we aim along this thesis to improve their understanding, study their properties, and define their limitations.

Interpretable Machines: Constructing Valid Prediction Intervals with Random Forests Machine Learning

An important issue when using Machine Learning algorithms in recent research is the lack of interpretability. Although these algorithms provide accurate point predictions for various learning problems, uncertainty estimates connected with point predictions are rather sparse. A contribution to this gap for the Random Forest Regression Learner is presented here. Based on its Out-of-Bag procedure, several parametric and non-parametric prediction intervals are provided for Random Forest point predictions and theoretical guarantees for its correct coverage probability is delivered. In a second part, a thorough investigation through Monte-Carlo simulation is conducted evaluating the performance of the proposed methods from three aspects: (i) Analyzing the correct coverage rate of the proposed prediction intervals, (ii) Inspecting interval width and (iii) Verifying the competitiveness of the proposed intervals with existing methods. The simulation yields that the proposed prediction intervals are robust towards non-normal residual distributions and are competitive by providing correct coverage rates and comparably narrow interval lengths, even for comparably small samples.

The Last State of Artificial Intelligence in Project Management Artificial Intelligence

Artificial intelligence (AI) has been used to advance different fields, such as education, healthcare, and finance. However, the application of AI in the field of project management (PM) has not progressed equally. This paper reports on a systematic review of the published studies used to investigate the application of AI in PM. This systematic review identified relevant papers using Web of Science, Science Direct, and Google Scholar databases. Of the 652 articles found, 58 met the predefined criteria and were included in the review. Included papers were classified per the following dimensions: PM knowledge areas, PM processes, and AI techniques. The results indicated that the application of AI in PM was in its early stages and AI models have not applied for multiple PM processes especially in processes groups of project stakeholder management, project procurements management, and project communication management. However, the most popular PM processes among included papers were project effort prediction and cost estimation, and the most popular AI techniques were support vector machines, neural networks, and genetic algorithms.

A unified survey on treatment effect heterogeneity modeling and uplift modeling Machine Learning

A central question in many fields of scientific research is to determine how an outcome would be affected by an action, or to measure the effect of an action (a.k.a treatment effect). In recent years, a need for estimating the heterogeneous treatment effects conditioning on the different characteristics of individuals has emerged from research fields such as personalized healthcare, social science, and online marketing. To meet the need, researchers and practitioners from different communities have developed algorithms by taking the treatment effect heterogeneity modeling approach and the uplift modeling approach, respectively. In this paper, we provide a unified survey of these two seemingly disconnected yet closely related approaches under the potential outcome framework. We then provide a structured survey of existing methods by emphasizing on their inherent connections with a set of unified notations to make comparisons of the different methods easy. We then review the main applications of the surveyed methods in personalized marketing, personalized medicine, and social studies. Finally, we summarize the existing software packages and present discussions based on the use of methods on synthetic, semi-synthetic and real world data sets and provide some general guidelines for choosing methods.

Using Machine Learning to Forecast Future Earnings Machine Learning

In this essay, we have comprehensively evaluated the feasibility and suitability of adopting the Machine Learning Models on the forecast of corporation fundamentals (i.e. the earnings), where the prediction results of our method have been thoroughly compared with both analysts' consensus estimation and traditional statistical models. As a result, our model has already been proved to be capable of serving as a favorable auxiliary tool for analysts to conduct better predictions on company fundamentals. Compared with previous traditional statistical models being widely adopted in the industry like Logistic Regression, our method has already achieved satisfactory advancement on both the prediction accuracy and speed. Meanwhile, we are also confident enough that there are still vast potentialities for this model to evolve, where we do hope that in the near future, the machine learning model could generate even better performances compared with professional analysts.

A Numerical Transform of Random Forest Regressors corrects Systematically-Biased Predictions Machine Learning

Over the past decade, random forest models have become widely used as a robust method for high-dimensional data regression tasks. In part, the popularity of these models arises from the fact that they require little hyperparameter tuning and are not very susceptible to overfitting. Random forest regression models are comprised of an ensemble of decision trees that independently predict the value of a (continuous) dependent variable; predictions from each of the trees are ultimately averaged to yield an overall predicted value from the forest. Using a suite of representative real-world datasets, we find a systematic bias in predictions from random forest models. We find that this bias is recapitulated in simple synthetic datasets, regardless of whether or not they include irreducible error (noise) in the data, but that models employing boosting do not exhibit this bias. Here we demonstrate the basis for this problem, and we use the training data to define a numerical transformation that fully corrects it. Application of this transformation yields improved predictions in every one of the real-world and synthetic datasets evaluated in our study.

Asymptotic Unbiasedness of the Permutation Importance Measure in Random Forest Models Machine Learning

Variable selection in sparse regression models is an important task as applications ranging from biomedical research to econometrics have shown. Especially for higher dimensional regression problems, for which the link function between response and covariates cannot be directly detected, the selection of informative variables is challenging. Under these circumstances, the Random Forest method is a helpful tool to predict new outcomes while delivering measures for variable selection. One common approach is the usage of the permutation importance. Due to its intuitive idea and flexible usage, it is important to explore circumstances, for which the permutation importance based on Random Forest correctly indicates informative covariates. Regarding the latter, we deliver theoretical guarantees for the validity of the permutation importance measure under specific assumptions and prove its (asymptotic) unbiasedness. An extensive simulation study verifies our findings.

Variable Selection with Random Survival Forest and Bayesian Additive Regression Tree for Survival Data Machine Learning

In this paper we utilize a survival analysis methodology incorporating Bayesian additive regression trees to account for nonlinear and additive covariate effects. We compare the performance of Bayesian additive regression trees, Cox proportional hazards and random survival forests models for censored survival data, using simulation studies and survival analysis for breast cancer with U.S. SEER database for the year 2005. In simulation studies, we compare the three models across varying sample sizes and censoring rates on the basis of bias and prediction accuracy. In survival analysis for breast cancer, we retrospectively analyze a subset of 1500 patients having invasive ductal carcinoma that is a common form of breast cancer mostly affecting older woman. Predictive potential of the three models are then compared using some widely used performance assessment measures in survival literature.

Scheduling optimization of parallel linear algebra algorithms using Supervised Learning Machine Learning

Linear algebra algorithms are used widely in a variety of domains, e.g machine learning, numerical physics and video games graphics. For all these applications, loop-level parallelism is required to achieve high performance. However, finding the optimal way to schedule the workload between threads is a non-trivial problem because it depends on the structure of the algorithm being parallelized and the hardware the executable is run on. In the realm of Asynchronous Many Task runtime systems, a key aspect of the scheduling problem is predicting the proper chunk-size, where the chunk-size is defined as the number of iterations of a for-loop assigned to a thread as one task. In this paper, we study the applications of supervised learning models to predict the chunk-size which yields maximum performance on multiple parallel linear algebra operations using the HPX backend of Blaze's linear algebra library. More precisely, we generate our training and tests sets by measuring performance of the application with different chunk-sizes for multiple linear algebra operations; vector-addition, matrix-vector-multiplication, matrix-matrix addition and matrix-matrix-multiplication. We compare the use of logistic regression, neural networks and decision trees with a newly developed decision tree based model in order to predict the optimal value for chunk-size. Our results show that classical decision trees and our custom decision tree model are able to forecast a chunk-size which results in good performance for the linear algebra operations.