Goto

Collaborating Authors

 Ensemble Learning


WiFi Based Distance Estimation Using Supervised Machine Learning

arXiv.org Artificial Intelligence

In recent years WiFi became the primary source of information to locate a person or device indoor. Collecting RSSI values as reference measurements with known positions, known as WiFi fingerprinting, is commonly used in various positioning methods and algorithms that appear in literature. However, measuring the spatial distance between given set of WiFi fingerprints is heavily affected by the selection of the signal distance function used to model signal space as geospatial distance. In this study, the authors proposed utilization of machine learning to improve the estimation of geospatial distance between fingerprints. This research examined data collected from 13 different open datasets to provide a broad representation aiming for general model that can be used in any indoor environment. The proposed novel approach extracted data features by examining a set of commonly used signal distance metrics via feature selection process that includes feature analysis and genetic algorithm. To demonstrate that the output of this research is venue independent, all models were tested on datasets previously excluded during the training and validation phase. Finally, various machine learning algorithms were compared using wide variety of evaluation metrics including ability to scale out the test bed to real world unsolicited datasets.


Model Optimization in Imbalanced Regression

arXiv.org Artificial Intelligence

Imbalanced domain learning aims to produce accurate models in predicting instances that, though underrepresented, are of utmost importance for the domain. Research in this field has been mainly focused on classification tasks. Comparatively, the number of studies carried out in the context of regression tasks is negligible. One of the main reasons for this is the lack of loss functions capable of focusing on minimizing the errors of extreme (rare) values. Recently, an evaluation metric was introduced: Squared Error Relevance Area (SERA). This metric posits a bigger emphasis on the errors committed at extreme values while also accounting for the performance in the overall target variable domain, thus preventing severe bias. However, its effectiveness as an optimization metric is unknown. In this paper, our goal is to study the impacts of using SERA as an optimization criterion in imbalanced regression tasks. Using gradient boosting algorithms as proof of concept, we perform an experimental study with 36 data sets of different domains and sizes. Results show that models that used SERA as an objective function are practically better than the models produced by their respective standard boosting algorithms at the prediction of extreme values. This confirms that SERA can be embedded as a loss function into optimization-based learning algorithms for imbalanced regression scenarios.


Tuning XGBoost Hyperparameters - KDnuggets

#artificialintelligence

To recap, XGBoost stands for Extreme Gradient Boosting and is a supervised learning algorithm that falls under the gradient-boosted decision tree (GBDT) family of machine learning algorithms. They make their predictions based on combining a set of weaker models and evaluate other decision trees through if-then-else true/false feature questions. They are created in sequential form to assess and estimate the probability of producing a correct decision. Before we get into the tuning of XGBoost hyperparamters, let's understand why tuning is important Hyperparameter tuning is a vital part of improving the overall behavior and performance of a machine learning model. It is a type of parameter that is set before the learning process and happens outside of the model.


The Magic of XGBoost

#artificialintelligence

XGBoost stands for eXtreme Gradient Boosted trees. It is an ensemble machine learning method. In this type of learning, the base weak learners, working in a chained sequence, learn from each other's mistakes and try to achieve good results with small improvements. Ensemble learning algorithms are super powerful and XGBoost is the superstar. It is widely used because of its ease of use, speed, and achievement. It works for both regression and classification problems and there's a really good chance that it is going to prove to be the best model to fit your data.


Succinct Differentiation of Disparate Boosting Ensemble Learning Methods for Prognostication of Polycystic Ovary Syndrome Diagnosis

arXiv.org Artificial Intelligence

The most common gynecological disorder affecting women globally is known as polycystic ovary syndrome (PCOS). The symptoms of PCOS include irregular periods, hirsutism, thinning hair and hair loss over head, oily skin or acne and weight gain. PCOS can lead to risk in later life with a lifelong situation that causes a person's blood sugar levels to promote type-II diabetes. High blood pressure and high cholesterol which can lead to heart stroke, overweight ladies may expand sleep apnoea, a situation that causes interrupted breathing at some stage in sleep. Around 10 - 15% of reproductive age (15 to 49 years) of women suffer from this. The monetary expenses of this disease and its comorbidities need the development of instruments and techniques so one can permit for early and precise identification. To cope with this problem this paper proposes a system for the early detection and prediction of PCOS from the most reliable and minimal and promising scientific and metabolic parameters, which is early detection for these diseases. Machine Learning[Shinde and Shah, 2018] can be leveraged to perform prognostication of PCOS that exigently extracts factual records from the given statistics considering the fact that machine learning is better known as glorified statistics. A specific type of machine learning algorithm that seeks to improve the overall performance by combining the predictions from more than one model which is a trendy meta method is known as an Ensemble Learning Approach.


Forecasting COVID-19 spreading trough an ensemble of classical and machine learning models: Spain's case study

arXiv.org Artificial Intelligence

In this work we evaluate the applicability of an ensemble of population models and machine learning models to predict the near future evolution of the COVID-19 pandemic, with a particular use case in Spain. We rely solely in open and public datasets, fusing incidence, vaccination, human mobility and weather data to feed our machine learning models (Random Forest, Gradient Boosting, k-Nearest Neighbours and Kernel Ridge Regression). We use the incidence data to adjust classic population models (Gompertz, Logistic, Richards, Bertalanffy) in order to be able to better capture the trend of the data. We then ensemble these two families of models in order to obtain a more robust and accurate prediction. Furthermore, we have observed an improvement in the predictions obtained with machine learning models as we add new features (vaccines, mobility, climatic conditions), analyzing the importance of each of them using Shapley Additive Explanation values. As in any other modelling work, data and predictions quality have several limitations and therefore they must be seen from a critical standpoint, as we discuss in the text. Our work concludes that the ensemble use of these models improves the individual predictions (using only machine learning models or only population models) and can be applied, with caution, in cases when compartmental models cannot be utilized due to the lack of relevant data.


2060: Civilization, Energy, and Progression of Mankind on the Kardashev Scale

arXiv.org Artificial Intelligence

Energy has been propelling the development of human civilization for millennia, and technologies acquiring energy beyond human and animal power have been continuously advanced and transformed. In 1964, the Kardashev Scale was proposed to quantify the relationship between energy consumption and the development of civilizations. Human civilization presently stands at Type 0.7276 on this scale. Projecting the future energy consumption, estimating the change of its constituting structure, and evaluating the influence of possible technological revolutions are critical in the context of civilization development. In this study, we use two machine learning models, random forest (RF) and autoregressive integrated moving average (ARIMA), to simulate and predict energy consumption on a global scale. We further project the position of human civilization on the Kardashev Scale in 2060. The result shows that the global energy consumption is expected to reach 928-940 EJ in 2060, with a total growth of over 50% in the coming 40 years, and our civilization is expected to achieve Type 0.7474 on the Kardashev Scale, still far away from a Type 1 civilization. Additionally, we discuss the potential energy segmentation change before 2060 and present the influence of the advent of nuclear fusion in this context.


A Novel Ontology-guided Attribute Partitioning Ensemble Learning Model for Early Prediction of Cognitive Deficits using Quantitative Structural MRI in Very Preterm Infants

arXiv.org Artificial Intelligence

Structural magnetic resonance imaging studies have shown that brain anatomical abnormalities are associated with cognitive deficits in preterm infants. Brain maturation and geometric features can be used with machine learning models for predicting later neurodevelopmental deficits. However, traditional machine learning models would suffer from a large feature-to-instance ratio (i.e., a large number of features but a small number of instances/samples). Ensemble learning is a paradigm that strategically generates and integrates a library of machine learning classifiers and has been successfully used on a wide variety of predictive modeling problems to boost model performance. Attribute (i.e., feature) bagging method is the most commonly used feature partitioning scheme, which randomly and repeatedly draws feature subsets from the entire feature set. Although attribute bagging method can effectively reduce feature dimensionality to handle the large feature-to-instance ratio, it lacks consideration of domain knowledge and latent relationship among features. In this study, we proposed a novel Ontology-guided Attribute Partitioning (OAP) method to better draw feature subsets by considering the domain-specific relationship among features. With the better partitioned feature subsets, we developed an ensemble learning framework, which is referred to as OAP-Ensemble Learning (OAP-EL). We applied the OAP-EL to predict cognitive deficits at 2 years of age using quantitative brain maturation and geometric features obtained at term equivalent age in very preterm infants. We demonstrated that the proposed OAP-EL approach significantly outperformed the peer ensemble learning and traditional machine learning approaches.


Introduction to Adaptive Boosting Classifier

#artificialintelligence

Adaptive Boosting Classifier is an ensemble classifier developed by Yoav Freund and Robert Schapire. This algorithm works by creating a prediction model in the form of a set of weak models. It requires specifying a set of weak learners before actually starting it. The weight of each model is determined based on whether it correctly predicted the sample or not. In a situation where the learner has predicted wrong, his weight is slightly reduced. The whole process is carried out until convergence[1].


A Computational Exploration of Emerging Methods of Variable Importance Estimation

arXiv.org Artificial Intelligence

Estimating the importance of variables is an essential task in modern machine learning. This help to evaluate the goodness of a feature in a given model. Several techniques for estimating the importance of variables have been developed during the last decade. In this paper, we proposed a computational and theoretical exploration of the emerging methods of variable importance estimation, namely: Least Absolute Shrinkage and Selection Operator (LASSO), Support Vector Machine (SVM), the Predictive Error Function (PERF), Random Forest (RF), and Extreme Gradient Boosting (XGBOOST) that were tested on different kinds of real-life and simulated data. All these methods can handle both regression and classification tasks seamlessly but all fail when it comes to dealing with data containing missing values. The implementation has shown that PERF has the best performance in the case of highly correlated data closely followed by RF. PERF and XGBOOST are "data-hungry" methods, they had the worst performance on small data sizes but they are the fastest when it comes to the execution time. SVM is the most appropriate when many redundant features are in the dataset. A surplus with the PERF is its natural cut-off at zero helping to separate positive and negative scores with all positive scores indicating essential and significant features while the negatives score indicates useless features. RF and LASSO are very versatile in a way that they can be used in almost all situations despite they are not giving the best results.