Burk, Lukas
Conditional Feature Importance with Generative Modeling Using Adversarial Random Forests
Blesch, Kristin, Koenen, Niklas, Kapar, Jan, Golchian, Pegah, Burk, Lukas, Loecher, Markus, Wright, Marvin N.
Explainable artificial intelligence (XAI) aims to shed light on the opaque behavior of machine learning algorithms, which includes assessing the importance of features for a predictive algorithm. Model-agnostic post hoc methods attribute scores to input features according to their relevance for the prediction in an arbitrary, already fitted supervised machine learning model (Molnar, 2020; Murdoch et al., 2019). Refined conceptualizations include, for example, methods aiming for insights on the prediction of individual observations, like Shapley additive explanations (Lundberg and Lee, 2017), or a feature importance focus on the model's overall behavior, yielding global-level explanations. A crucial distinction in feature importance concepts is between conditional and marginal viewpoints (Strobl et al., 2008; Watson and Wright, 2021): Marginal feature importance evaluates a feature's impact irrespective of other features included in the model, whereas conditional feature importance takes the predictive information of other features into account. The presence of dependency structures, which real-world datasets frequently exhibit, plays a pivotal role in this distinction because a feature's impact on the prediction given, i.e., on top of the predictive information provided by correlated features, alters the importance score attributed (Watson and Wright, 2021).
A Large-Scale Neutral Comparison Study of Survival Models on Low-Dimensional Data
Burk, Lukas, Zobolas, John, Bischl, Bernd, Bender, Andreas, Wright, Marvin N., Sonabend, Raphael
This work presents the first large-scale neutral benchmark experiment focused on single-event, right-censored, low-dimensional survival data. Benchmark experiments are essential in methodological research to scientifically compare new and existing model classes through proper empirical evaluation. Existing benchmarks in the survival literature are often narrow in scope, focusing, for example, on high-dimensional data. Additionally, they may lack appropriate tuning or evaluation procedures, or are qualitative reviews, rather than quantitative comparisons. This comprehensive study aims to fill the gap by neutrally evaluating a broad range of methods and providing generalizable conclusions. We benchmark 18 models, ranging from classical statistical approaches to many common machine learning methods, on 32 publicly available datasets. The benchmark tunes for both a discrimination measure and a proper scoring rule to assess performance in different settings. Evaluating on 8 survival metrics, we assess discrimination, calibration, and overall predictive performance of the tested models. Using discrimination measures, we find that no method significantly outperforms the Cox model. However, (tuned) Accelerated Failure Time models were able to achieve significantly better results with respect to overall predictive performance as measured by the right-censored log-likelihood. Machine learning methods that performed comparably well include Oblique Random Survival Forests under discrimination, and Cox-based likelihood-boosting under overall predictive performance. We conclude that for predictive purposes in the standard survival analysis setting of low-dimensional, right-censored data, the Cox Proportional Hazards model remains a simple and robust method, sufficient for practitioners.