Goto

Collaborating Authors

 Performance Analysis


To tune or not to tune the number of trees in random forest?

arXiv.org Machine Learning

The number of trees T in the random forest (RF) algorithm for supervised learning has to be set by the user. It is controversial whether T should simply be set to the largest computationally manageable value or whether a smaller T may in some cases be better. While the principle underlying bagging is that "more trees are better", in practice the classification error rate sometimes reaches a minimum before increasing again for increasing number of trees. The goal of this paper is four-fold: (i) providing theoretical results showing that the expected error rate may be a non-monotonous function of the number of trees and explaining under which circumstances this happens; (ii) providing theoretical results showing that such non-monotonous patterns cannot be observed for other performance measures such as the Brier score and the logarithmic loss (for classification) and the mean squared error (for regression); (iii) illustrating the extent of the problem through an application to a large number (n = 306) of datasets from the public database OpenML; (iv) finally arguing in favor of setting it to a computationally feasible large number, depending on convergence properties of the desired performance measure.


The Best Metric to Measure Accuracy of Classification Models

@machinelearnbot

Unlike evaluating the accuracy of models that predict a continuous or discrete dependent variable like Linear Regression models, evaluating the accuracy of a classification model could be more complex and time-consuming. Before measuring the accuracy of classification models, an analyst would first measure its robustness with the help of metrics such as AIC-BIC, AUC-ROC, AUC- PR, Kolmogorov-Smirnov chart, etc. The next logical step is to measure its accuracy. To understand the complexity behind measuring the accuracy, we need to know few basic concepts. E.g. โ€“ A classification model like Logistic Regression will output a probability number between 0 and 1 instead of the desired output of actual target variable like Yes/No, etc.


Machine Learning: An In-Depth Guide - Model Evaluation, Validation, Complexity, and Improvement

#artificialintelligence

Welcome to the third article in a five-part series about machine learning. In this article, we'll continue our machine learning discussion, and focus on problems associated with overfitting data, as well as controlling model complexity, a model evaluation and errors introduction, model validation and tuning, and improving model performance. Overfitting is one of the greatest concerns in predictive analytics and machine learning. Overfitting refers to a situation where the model chosen to fit the training data fits too well, and essentially captures all of the noise, outliers, and so on. The consequence of this is that the model will fit the training data very well, but will not accurately predict cases not represented by the training data, and therefore will not generalize well to unseen data.


Loan Prediction โ€“ Using PCA and Naive Bayes Classification with R

@machinelearnbot

Nowadays, there are numerous risks related to bank loans both for the banks and the borrowers getting the loans. The risk analysis about bank loans needs understanding about the risk and the risk level. Banks need to analyze their customers for loan eligibility so that they can specifically target those customers. Banks wanted to automate the loan eligibility process (real time) based on customer details such as Gender, Marital Status, Age, Occupation, Income, debts, and others provided in their online application form. As the number of transactions in banking sector is rapidly growing and huge data volumes are available, the customers' behavior can be easily analyzed and the risks around loan can be reduced.


Extending Defensive Distillation

arXiv.org Machine Learning

Deployed machine learning (ML) models are vulnerable to inputs maliciously perturbed to force them to mispredict [1, 2]. A class of such inputs, named adversarial examples, are systematically constructed through slight perturbations of otherwise correctly classified inputs [3, 4]. These perturbations are chosen to maximize the model's prediction error while leaving the semantics of the input unchanged. Although this often poses a non-tractable optimization problem for popular architectures like deep neural networks, heuristics allow the adversary to find effective perturbations--typically through the evaluation of gradients of the model's output with respect to its inputs [3, 5]. To defend against adversarial examples, two classes of approaches exist.


Boosting Factor-Specific Functional Historical Models for the Detection of Synchronisation in Bioelectrical Signals

arXiv.org Machine Learning

The link between different psychophysiological measures during emotion episodes is not well understood. To analyse the functional relationship between electroencephalography (EEG) and facial electromyography (EMG), we apply historical function-on-function regression models to EEG and EMG data that were simultaneously recorded from 24 participants while they were playing a computerised gambling task. Given the complexity of the data structure for this application, we extend simple functional historical models to models including random historical effects, factor-specific historical effects, and factor-specific random historical effects. Estimation is conducted by a component-wise gradient boosting algorithm, which scales well to large data sets and complex models.


Comparison of Decision Tree Based Classification Strategies to Detect External Chemical Stimuli from Raw and Filtered Plant Electrical Response

arXiv.org Machine Learning

Plants monitor their surrounding environment and control their physiological functions by producing an electrical response. We recorded electrical signals from different plants by exposing them to Sodium Chloride (NaCl), Ozone (O3) and Sulfuric Acid (H2SO4) under laboratory conditions. After applying pre-processing techniques such as filtering and drift removal, we extracted few statistical features from the acquired plant electrical signals. Using these features, combined with different classification algorithms, we used a decision tree based multi-class classification strategy to identify the three different external chemical stimuli. We here present our exploration to obtain the optimum set of ranked feature and classifier combination that can separate a particular chemical stimulus from the incoming stream of plant electrical signals. The paper also reports an exhaustive comparison of similar feature based classification using the filtered and the raw plant signals, containing the high frequency stochastic part and also the low frequency trends present in it, as two different cases for feature extraction. The work, presented in this paper opens up new possibilities for using plant electrical signals to monitor and detect other environmental stimuli apart from NaCl, O3 and H2SO4 in future.


Spam Users Identification in Wikipedia Via Editing Behavior

AAAI Conferences

In this paper, we address the problem of identifying spam users on Wikipedia and present our preliminary results. We formulate the problem as a binary classification task and propose a set of features based on user editing behavior to separate spammers from benign users. We tested our system on a new dataset we built consisting of 4.2K (half spam and half benign) users and 75.6K edits. Experimental results show that our approach reaches 80.8% classification accuracy and 0.88 mean average precision. We compared against ORES, the most recent tool developed by Wikimedia which assigns a damaging score to each edit, and we show that our system outperforms ORES in spam users detection. Moreover, by combining our features with ORES, classification accuracy increases to 82.1%. Additionally, we also show that our system performs well in a more realistic, unbalanced setting, that is, when spammers are greatly outnumbered by benign users, by achieving an AUROC of 0.84 (which increases to 0.86 when we combine with ORES).


Document Classification with scikit-learn

@machinelearnbot

Document classification is a fundamental machine learning task. It is used for all kinds of applications, like filtering spam, routing support request to the right support rep, language detection, genre classification, sentiment analysis, and many more. To demonstrate text classification with scikit-learn, we're going to build a simple spam filter. While the filters in production for services like Gmail are vastly more sophisticated, the model we'll have by the end of this tutorial is effective, and surprisingly accurate. Spam filtering is kind of like the "Hello world" of document classification. However, something to be aware of is that you aren't limited to two classes.


Looking For AI Exposure? Cyber Security May Have You Covered

#artificialintelligence

One of the complications, and opportunities, confronting the cyber security industry is that the cyber threat may be escalating beyond the capacity of a human-centric response. Consider for instance the remarks of the outgoing chief of the Department of Defense, that "given the volume [of attacks] and where I see the threat moving it will be impossible for humans by themselves to keep pace." The DoD currently finds itself amidst a $1.6 billion project of centralizing its hundreds of separate firewalls into a unified system, the end purpose being to enable effective integration of artificial intelligence capabilities. While this DoD example is an isolated one, it nonetheless epitomizes the human limitation in countering the cyber threat, which is primarily a digitalized, computer-driven hazard. As Benedict Cumberbatch playing Alan Turing in The Imitation Game quipped, "our problem is that we're trying to beat [enigma] with men. What if only a machine can defeat another machine?"