The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.
A crucial point for ensemble learning systems is the capacity of making different errors on any given sample, which highlights the importance of diversity for ensemble-based decision systems. A usual way of increasing diversity is to combine traditional ensemble methods. Based on this context, we propose a novel combining-based algorithm of pool generation using a merging of bagging, random patches, and boosting techniques for ensemble regression problems. Numerical results indicate that, depending on both the dataset and the diversity measurement, our proposal generates a pool of regressors with more diversity when compared to single ensemble generator approaches.
Building a good predictive model requires an array of activities such as data imputation, feature transformations, estimator selection, hyper-parameter search and ensemble construction. Given the large, complex and heterogenous space of options, off-the-shelf optimization methods are infeasible for realistic response times. In practice, much of the predictive modeling process is conducted by experienced data scientists, who selectively make use of available tools. Over time, they develop an understanding of the behavior of operators, and perform serial decision making under uncertainty, colloquially referred to as educated guesswork. With an unprecedented demand for application of supervised machine learning, there is a call for solutions that automatically search for a good combination of parameters across these tasks to minimize the modeling error. We introduce a novel system called APRL (Autonomous Predictive modeler via Reinforcement Learning), that uses past experience through reinforcement learning to optimize such sequential decision making from within a set of diverse actions under a time constraint on a previously unseen predictive learning problem. APRL actions are taken to optimize the performance of a final ensemble. This is in contrast to other systems, which maximize individual model accuracy first and create ensembles as a disconnected post-processing step. As a result, APRL is able to reduce up to 71\% of classification error on average over a wide variety of problems.
Ensemble learning combines several individual models to obtain better generalization performance. Currently, deep learning models with multilayer processing architecture is showing better performance as compared to the shallow or traditional classification models. Deep ensemble learning models combine the advantages of both the deep learning models as well as the ensemble learning such that the final model has better generalization performance. This paper reviews the state-of-art deep ensemble models and hence serves as an extensive summary for the researchers. The ensemble models are broadly categorised into ensemble models like bagging, boosting and stacking, negative correlation based deep ensemble models, explicit/implicit ensembles, homogeneous /heterogeneous ensemble, decision fusion strategies, unsupervised, semi-supervised, reinforcement learning and online/incremental, multilabel based deep ensemble models. Application of deep ensemble models in different domains is also briefly discussed. Finally, we conclude this paper with some future recommendations and research directions.
Very few K-nearest-neighbor (KNN) ensembles exist, despite the efficacy of this approach in regression, classification, and outlier detection. Those that do exist focus on bagging features, rather than varying k or bagging observations; it is unknown whether varying k or bagging observations can improve prediction. Given recent studies from topological data analysis, varying k may function like multiscale topological methods, providing stability and better prediction, as well as increased ensemble diversity. This paper explores 7 KNN ensemble algorithms combining bagged features, bagged observations, and varied k to understand how each of these contribute to model fit. Specifically, these algorithms are tested on Tweedie regression problems through simulations and 6 real datasets; results are compared to state-of-the-art machine learning models including extreme learning machines, random forest, boosted regression, and Morse-Smale regression. Results on simulations suggest gains from varying k above and beyond bagging features or samples, as well as the robustness of KNN ensembles to the curse of dimensionality. KNN regression ensembles perform favorably against state-of-the-art algorithms and dramatically improve performance over KNN regression. Further, real dataset results suggest varying k is a good strategy in general (particularly for difficult Tweedie regression problems) and that KNN regression ensembles often outperform state-of-the-art methods. These results for k-varying ensembles echo recent theoretical results in topological data analysis, where multidimensional filter functions and multiscale coverings provide stability and performance gains over single-dimensional filters and single-scale covering. This opens up the possibility of leveraging multiscale neighborhoods and multiple measures of local geometry in ensemble methods.