minimal depth
On Minimal Depth in Neural Networks
A characterization of the representability of neural networks is relevant to comprehend their success in artificial intelligence. This study investigate two topics on ReLU neural network expressivity and their connection with a conjecture related to the minimum depth required for representing any continuous piecewise linear (CPWL) function. The topics are the minimal depth representation of the sum and max operations, as well as the exploration of polytope neural networks. For the sum operation, we establish a sufficient condition on the minimal depth of the operands to find the minimal depth of the operation. In contrast, regarding the max operation, a comprehensive set of examples is presented, demonstrating that no sufficient conditions, depending solely on the depth of the operands, would imply a minimal depth for the operation. The study also examine the minimal depth relationship between convex CPWL functions. On polytope neural networks, we investigate basic depth properties from Minkowski sums, convex hulls, number of vertices, faces, affine transformations, and indecomposable polytopes. More significant findings include depth characterization of polygons; identification of polytopes with an increasing number of vertices, exhibiting small depth and others with arbitrary large depth; and most notably, the minimal depth of simplices, which is strictly related to the minimal depth conjecture in ReLU networks.
Random survival forests with multivariate longitudinal endogenous covariates
Devaux, Anthony, Helmer, Catherine, Genuer, Robin, Proust-Lima, Cรฉcile
Predicting the individual risk of a clinical event using the complete patient history is still a major challenge for personalized medicine. Among the methods developed to compute individual dynamic predictions, the joint models have the assets of using all the available information while accounting for dropout. However, they are restricted to a very small number of longitudinal predictors. Our objective was to propose an innovative alternative solution to predict an event probability using a possibly large number of longitudinal predictors. We developed DynForest, an extension of competing-risk random survival forests that handles endogenous longitudinal predictors. At each node of the tree, the time-dependent predictors are translated into time-fixed features (using mixed models) to be used as candidates for splitting the subjects into two subgroups. The individual event probability is estimated in each tree by the Aalen-Johansen estimator of the leaf in which the subject is classified according to his/her history of predictors. The final individual prediction is given by the average of the tree-specific individual event probabilities. We carried out a simulation study to demonstrate the performances of DynForest both in a small dimensional context (in comparison with joint models) and in a large dimensional context (in comparison with a regression calibration method that ignores informative dropout). We also applied DynForest to (i) predict the individual probability of dementia in the elderly according to repeated measures of cognitive, functional, vascular and neuro-degeneration markers, and (ii) quantify the importance of each type of markers for the prediction of dementia. Implemented in the R package DynForest, our methodology provides a novel and appropriate solution for the prediction of events from any number of longitudinal endogenous predictors.
Random Forests for time-fixed and time-dependent predictors: The DynForest R package
Devaux, Anthony, Proust-Lima, Cรฉcile, Genuer, Robin
The R package DynForest implements random forests for predicting a categorical or a (multiple causes) time-to-event outcome based on time-fixed and time-dependent predictors. Through the random forests, the time-dependent predictors can be measured with error at subject-specific times, and they can be endogeneous (i.e., impacted by the outcome process). They are modeled internally using flexible linear mixed models (thanks to lcmm package) with time-associations pre-specified by the user. DynForest computes dynamic predictions that take into account all the information from time-fixed and time-dependent predictors. DynForest also provides information about the most predictive variables using variable importance and minimal depth. Variable importance can also be computed on groups of variables. To display the results, several functions are available such as summary and plot functions. This paper aims to guide the user with a step-by-step example of the different functions for fitting random forests within DynForest.
Computational analysis of pathological image enables interpretable prediction for microsatellite instability
Zhu, Jin, Wu, Wangwei, Zhang, Yuting, Lin, Shiyun, Jiang, Yukang, Liu, Ruixian, Wang, Xueqin
Microsatellite instability (MSI) is associated with several tumor types and its status has become increasingly vital in guiding patient treatment decisions. However, in clinical practice, distinguishing MSI from its counterpart is challenging since the diagnosis of MSI requires additional genetic or immunohistochemical tests. In this study, interpretable pathological image analysis strategies are established to help medical experts to automatically identify MSI. The strategies only require ubiquitous Haematoxylin and eosin-stained whole-slide images and can achieve decent performance in the three cohorts collected from The Cancer Genome Atlas. The strategies provide interpretability in two aspects. On the one hand, the image-level interpretability is achieved by generating localization heat maps of important regions based on the deep learning network; on the other hand, the feature-level interpretability is attained through feature importance and pathological feature interaction analysis. More interestingly, both from the image-level and feature-level interpretability, color features and texture characteristics are shown to contribute the most to the MSI predictions. Therefore, the classification models under the proposed strategies can not only serve as an efficient tool for predicting the MSI status of patients, but also provide more insights to pathologists with clinical understanding.
A random forest based approach for predicting spreads in the primary catastrophe bond market
Makariou, Despoina, Barrieu, Pauline, Chen, Yining
We introduce a random forest approach to enable spreads' prediction in the primary catastrophe bond market. We investigate whether all information provided to investors in the offering circular prior to a new issuance is equally important in predicting its spread. The whole population of non-life catastrophe bonds issued from December 2009 to May 2018 is used. The random forest shows an impressive predictive power on unseen primary catastrophe bond data explaining 93% of the total variability. For comparison, linear regression, our benchmark model, has inferior predictive performance explaining only 47% of the total variability. All details provided in the offering circular are predictive of spread but in a varying degree. The stability of the results is studied. The usage of random forest can speed up investment decisions in the catastrophe bond industry.
Understanding random forests with randomForestExplainer
Next, we pass it to the function plot_min_depth_distribution and under default settings obtain obtain a plot of the distribution of minimal depth for top ten variables according to mean minimal depth calculated using top trees (mean_sample "top_trees"). We could also pass our forest directly to the plotting function but if we want to make more than one plot of the minimal depth distribution is more efficient to pass the min_depth_frame to the plotting function so that it will not be calculated again for each plot (this works similarly for other plotting functions of randomForestExplainer). The function plot_min_depth_distribution offers three possibilities when it comes to calculating the mean minimal depth, which differ in he way they treat missing values that appear when a variable is not used for splitting in a tree. Note that the depth of a tree is equal to the length of the longest path from root to leave in this tree. This equals the maximum depth of a variable in this tree plus one, as leaves are by definition not split by any variable.
ggRandomForests: Visually Exploring a Random Forest for Regression
Random Forests [Breiman:2001] (RF) are a fully non-parametric statistical method requiring no distributional assumptions on covariate relation to the response. RF are a robust, nonlinear technique that optimizes predictive accuracy by fitting an ensemble of trees to stabilize model estimates. The randomForestSRC package (http://cran.r-project.org/package=randomForestSRC) is a unified treatment of Breiman's random forests for survival, regression and classification problems. Predictive accuracy make RF an attractive alternative to parametric models, though complexity and interpretability of the forest hinder wider application of the method. We introduce the ggRandomForests package (http://cran.r-project.org/package=ggRandomForests), for visually understand random forest models grown in R with the randomForestSRC package. The vignette is a tutorial for using the ggRandomForests package with the randomForestSRC package for building and post-processing a regression random forest. In this tutorial, we explore a random forest model for the Boston Housing Data, available in the MASS package. We grow a random forest for regression and demonstrate how ggRandomForests can be used when determining variable associations, interactions and how the response depends on predictive variables within the model. The tutorial demonstrates the design and usage of many of ggRandomForests functions and features how to modify and customize the resulting ggplot2 graphic objects along the way. A development version of the ggRandomForests package is available on Github. We invite comments, feature requests and bug reports for this package at (https://github.com/ehrlinger/ggRandomForests).