Performance Analysis
Ensemble Methods for Survival Data with Time-Varying Covariates
Yao, Weichi, Frydman, Halina, Larocque, Denis, Simonoff, Jeffrey S.
Survival data with time-varying covariates are common in practice. However, the traditional survival forests - conditional inference forest, relative risk forest and random survival forest - have accommodated only time-invariant covariates. Similarly, the recently proposed transformation forest, which incorporates the split statistics suitable for non-proportional hazard settings, has employed only time-invariant covariates. We generalize the conditional inference and relative risk forests to allow time-varying covariates. We compare their performance with that of the Cox model and transformation forest, adapted to accommodate time-varying covariates, through a comprehensive simulation study in which the Kaplan-Meier estimate serves as a benchmark. In general, the performance of the two proposed forests substantially improves over the Kaplan-Meier estimate when the estimation conditions become more favorable. Taking into an account all other factors, under the PH setting, the best method is always one of the two proposed forests, while under the non-PH setting, it is the adapted transformation forest. The K-fold cross-validation can be an effective tool to choose between the methods in practice. Finally, the performance of the proposed forest methods for time-invariant covariate data is broadly similar to that found for time-varying covariate data. We also propose a general framework for estimation of a survival function in the presence of time-varying covariates, which can be applied to any method that uses the counting process (pseudo-subject) approach to handling time-varying covariates. This novel estimate of a single survival function takes multiple survival estimation outputs corresponding to each pseudo-subject, and combines them in a theoretically-justified way to form a proper monotone-decreasing survival function estimate.
Fair Classification with Group-Dependent Label Noise
Wang, Jialu, Liu, Yang, Levy, Caleb
This work examines how to train fair classifiers in settings where training labels are corrupted with random noise, and where the error rates of corruption depend both on the label class and on the membership function for a protected subgroup. Heterogeneous label noise models systematic biases towards particular groups when generating annotations. We begin by presenting analytical results which show that naively imposing parity constraints on demographic disparity measures, without accounting for heterogeneous and group-dependent error rates, can decrease both the accuracy and the fairness of the resulting classifier. Our experiments demonstrate these issues arise in practice as well. We address these problems by performing empirical risk minimization with carefully defined surrogate loss functions and surrogate constraints that help avoid the pitfalls introduced by heterogeneous label noise. We provide both theoretical and empirical justifications for the efficacy of our methods. We view our results as an important example of how imposing fairness on biased data sets without proper care can do at least as much harm as it does good.
The Beginners' Guide to the ROC Curve and AUC
In the previous article here, you have understood classification evaluation metrics such as Accuracy, Precision, Recall, F1-Score, etc. In this article, we will go through another important evaluation metric AUC-ROC score. ROC curve (Receiver Operating Characteristic curve) is a graph showing the performance of a classification model at different probability thresholds. ROC graph is created by plotting FPR Vs. TPR where FPR (False Positive Rate) is plotted on the x-axis and TPR (True Positive Rate) is plotted on the y-axis for different probability threshold values ranging from 0.0 to 1.0.
Dealing with Imbalanced Data in Machine Learning - KDnuggets
As an ML engineer or data scientist, sometimes you inevitably find yourself in a situation where you have hundreds of records for one class label and thousands of records for another class label. Upon training your model you obtain an accuracy above 90%. You then realize that the model is predicting everything as if it's in the class with the majority of records. Excellent examples of this are fraud detection problems and churn prediction problems, where the majority of the records are in the negative class. What do you do in such a scenario?
Financial Data Analysis Using Expert Bayesian Framework For Bankruptcy Prediction
Mukeri, Amir, Shaikh, Habibullah, Gaikwad, D. P.
In recent years, bankruptcy forecasting has gained lot of attention from researchers as well as practitioners in the field of financial risk management. For bankruptcy prediction, various approaches proposed in the past and currently in practice relies on accounting ratios and using statistical modeling or machine learning methods. These models have had varying degrees of successes. Models such as Linear Discriminant Analysis or Artificial Neural Network employ discriminative classification techniques. They lack explicit provision to include prior expert knowledge. In this paper, we propose another route of generative modeling using Expert Bayesian framework. The biggest advantage of the proposed framework is an explicit inclusion of expert judgment in the modeling process. Also the proposed methodology provides a way to quantify uncertainty in prediction. As a result the model built using Bayesian framework is highly flexible, interpretable and intuitive in nature. The proposed approach is well suited for highly regulated or safety critical applications such as in finance or in medical diagnosis. In such cases accuracy in the prediction is not the only concern for decision makers. Decision makers and other stakeholders are also interested in uncertainty in the prediction as well as interpretability of the model. We empirically demonstrate these benefits of proposed framework on real world dataset using Stan, a probabilistic programming language. We found that the proposed model is either comparable or superior to the other existing methods. Also resulting model has much less False Positive Rate compared to many existing state of the art methods. The corresponding R code for the experiments is available at Github repository.
Dataset Meta-Learning from Kernel Ridge-Regression
Nguyen, Timothy, Chen, Zhourung, Lee, Jaehoon
One of the most fundamental aspects of any machine learning algorithm is the training data used by the algorithm. We introduce the novel concept of ษ- approximation of datasets, obtaining datasets which are much smaller than or are significant corruptions of the original training data while maintaining similar model performance. We introduce a meta-learning algorithm called Kernel Inducing Points (KIP) for obtaining such remarkable datasets, inspired by the recent developments in the correspondence between infinitely-wide neural networks and kernel ridge-regression (KRR). For KRR tasks, we demonstrate that KIP can compress datasets by one or two orders of magnitude, significantly improving previous dataset distillation and subset selection methods while obtaining state of the art results for MNIST and CIFAR-10 classification. Furthermore, our KIP -learned datasets are transferable to the training of finite-width neural networks even beyond the lazy-training regime, which leads to state of the art results for neural network dataset distillation with potential applications to privacy-preservation. Datasets are a pivotal component in any machine learning task. Typically, a machine learning problem regards a dataset as given and uses it to train a model according to some specific objective. In this work, we depart from the traditional paradigm by instead optimizing a dataset with respect to a learning objective, from which the resulting dataset can be used in a range of downstream learning tasks. Our work is directly motivated by several challenges in existing learning methods. Kernel methods or instance-based learning (Vinyals et al., 2016; Snell et al., 2017; Kaya & Bilge, 2019) in general require a support dataset to be deployed at inference time. Achieving good prediction accuracy typically requires having a large support set, which inevitably increases both memory footprint and latency at inference time--the scalability issue. It can also raise privacy concerns when deploying a support set of original examples, e.g., distributing raw images to user devices. Additional challenges to scalability include, for instance, the desire for rapid hyper-parameter search (Shleifer & Prokop, 2019) and minimizing the resources consumed when replaying data for continual learning (Borsos et al., 2020). A valuable contribution to all these problems would be to find surrogate datasets that can mitigate the challenges which occur for naturally occurring datasets without a significant sacrifice in performance.
View selection in multi-view stacking: Choosing the meta-learner
van Loon, Wouter, Fokkema, Marjolein, Szabo, Botond, de Rooij, Mark
Multi-view stacking is a framework for combining information from different views (i.e. different feature sets) describing the same set of objects. In this framework, a base-learner algorithm is trained on each view separately, and their predictions are then combined by a meta-learner algorithm. In a previous study, stacked penalized logistic regression, a special case of multi-view stacking, has been shown to be useful in identifying which views are most important for prediction. In this article we expand this research by considering seven different algorithms to use as the meta-learner, and evaluating their view selection and classification performance in simulations and two applications on real gene-expression data sets. Our results suggest that if both view selection and classification accuracy are important to the research at hand, then the nonnegative lasso, nonnegative adaptive lasso and nonnegative elastic net are suitable meta-learners. Exactly which among these three is to be preferred depends on the research context. The remaining four meta-learners, namely nonnegative ridge regression, nonnegative forward selection, stability selection and the interpolating predictor, show little advantages in order to be preferred over the other three.
Information-theoretic Feature Selection via Tensor Decomposition and Submodularity
Amiridi, Magda, Kargas, Nikos, Sidiropoulos, Nicholas D.
Feature selection by maximizing high-order mutual information between the selected feature vector and a target variable is the gold standard in terms of selecting the best subset of relevant features that maximizes the performance of prediction models. However, such an approach typically requires knowledge of the multivariate probability distribution of all features and the target, and involves a challenging combinatorial optimization problem. Recent work has shown that any joint Probability Mass Function (PMF) can be represented as a naive Bayes model, via Canonical Polyadic (tensor rank) Decomposition. In this paper, we introduce a low-rank tensor model of the joint PMF of all variables and indirect targeting as a way of mitigating complexity and maximizing the classification performance for a given number of features. Through low-rank modeling of the joint PMF, it is possible to circumvent the curse of dimensionality by learning principal components of the joint distribution. By indirectly aiming to predict the latent variable of the naive Bayes model instead of the original target variable, it is possible to formulate the feature selection problem as maximization of a monotone submodular function subject to a cardinality constraint - which can be tackled using a greedy algorithm that comes with performance guarantees. Numerical experiments with several standard datasets suggest that the proposed approach compares favorably to the state-of-art for this important problem.
Failures of model-dependent generalization bounds for least-norm interpolation
Bartlett, Peter L., Long, Philip M.
Deep learning methodology has revealed some striking deficiencies of classical statistical learning theory: large neural networks, trained to zero empirical risk on noisy training data, have good predictive accuracy on independent test data. These methods are overfitting (that is, fitting to the training data better than the noise should allow), but the overfitting is benign (that is, prediction performance is good). It is an important open problem to understand why this is possible. The presence of noise is key to why the success of interpolating algorithms is mysterious. Generalization of algorithms that produce a perfect fit in the absence of noise has been studied for decades (see [Haussler, 1992] and its references). A number of recent papers have provided generalization bounds for interpolating algorithms in the absence of noise, either for deep networks or in abstract frameworks motivated by deep networks [Li and Liang, 2018, Arora et al., 2019, Cao and Gu, 2019, Feldman, 2020]. The generalization bounds in these papers either do not hold or become vacuous in the presence of noise: Assumption A1 in [Li and Liang, 2018] rules out noisy data; the data-dependent bound in Arora et al. [2019, Theorem 5.1] becomes vacuous when independent noise is added to the y
Machine-Learning the Sato--Tate Conjecture
He, Yang-Hui, Lee, Kyu-Hwan, Oliver, Thomas
We apply some of the latest techniques from machine-learning to the arithmetic of hyperelliptic curves. More precisely we show that, with impressive accuracy and confidence (between 99 and 100 percent precision), and in very short time (matter of seconds on an ordinary laptop), a Bayesian classifier can distinguish between Sato-Tate groups given a small number of Euler factors for the L-function. Our observations are in keeping with the Sato-Tate conjecture for curves of low genus. For elliptic curves, this amounts to distinguishing generic curves (with Sato-Tate group SU(2)) from those with complex multiplication. In genus 2, a principal component analysis is observed to separate the generic Sato-Tate group USp(4) from the non-generic groups. Furthermore in this case, for which there are many more non-generic possibilities than in the case of elliptic curves, we demonstrate an accurate characterisation of several Sato-Tate groups with the same identity component. Throughout, our observations are verified using known results from the literature and the data available in the LMFDB. The results in this paper suggest that a machine can be trained to learn the Sato-Tate distributions and may be able to classify curves much more efficiently than the methods available in the literature.