Accuracy
Yet Another Caret Workshop
We'll start with a place-holder regression example for completeness. You should always set the seed before calling train. Probably not the most amazing \(R 2\) value you have ever seen, but that's alright. Note that calling the model fit displays the most crucial information in a succinct way. Let's move on to a classification algorithm.
Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment
The quality of training data is one of the crucial problems when a learning-centered approach is employed. This paper proposes a new method to investigate the quality of a large corpus designed for the recognizing textual entailment (RTE) task. The proposed method, which is inspired by a statistical hypothesis test, consists of two phases: the first phase is to introduce the predictability of textual entailment labels as a null hypothesis which is extremely unacceptable if a target corpus has no hidden bias, and the second phase is to test the null hypothesis using a Naive Bayes model. The experimental result of the Stanford Natural Language Inference (SNLI) corpus does not reject the null hypothesis. Therefore, it indicates that the SNLI corpus has a hidden bias which allows prediction of textual entailment labels from hypothesis sentences even if no context information is given by a premise sentence. This paper also presents the performance impact of NN models for RTE caused by this hidden bias.
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary
Fernandez, Alberto, Garcia, Salvador, Herrera, Francisco, Chawla, Nitesh V.
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered "de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to different type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several different domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also significantly contributed to new supervised learning paradigms, including multilabel classification, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of different software packages -- from open source to commercial. In this paper, marking the fifteen year anniversary of SMOTE, we reflect on the SMOTE journey, discuss the current state of affairs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.
Streaming Active Learning Strategies for Real-Life Credit Card Fraud Detection: Assessment and Visualization
Carcillo, Fabirzio, Borgne, Yann-Aรซl Le, Caelen, Olivier, Bontempi, Gianluca
Some of them are related to the data distribution, notably the class imbalance of the training set (many more genuine transactions than fraudulent ones), the non-stationarity of the phenomenon (due to changes in the behavior of customers as well as in fraudsters), the large dimensionality and the overlapping classes (while fraudsters try to emulate cardholders behavior, genuine behaviors of cardholders might look strange or anomalous). The labeling process is constrained, as every day human investigators may contact only a small number of cardholders associated with the riskiest transactions and obtain the class (fraud or genuine) of the related transactions. The high cost of human labour, for assessing the transaction labels, leads to the labeling bottleneck [2]. In this context, an automatic Fraud Detection System (FDS) should support the activity of the investigators by letting them focus on the transactions with the highest fraud probability. From the perspective of the transactional service company, this is crucial in order to reduce the costs of the investigation activity and to retain the customer confidence. From a machine learning perspective it is important to keep an adequate balance between exploitation and exploration, i.e. between the short-term needs of providing good alerts to investigators, and the long-term goal of maintaining a high accuracy of the system (e.g. in the presence of concept drift). The issue of labeling the most informative data by minimizing the cost has been extensively addressed by active learning which can be considered as a specific instance of semi-supervised learning [8, 41], the domain studying how unlabeled and labeled data can both contribute to 2 Fabrizio Carcillo et al.
Comparison of ontology alignment systems across single matching task via the McNemar's test
Mohammadi, Majid, Atashin, Amir Ahooye, Hofman, Wout, Tan, Yao-Hua
Ontology alignment is widely-used to find the correspondences between different ontologies in diverse fields.After discovering the alignments,several performance scores are available to evaluate them.The scores typically require the identified alignment and a reference containing the underlying actual correspondences of the given ontologies.The current trend in the alignment evaluation is to put forward a new score(e.g., precision, weighted precision, etc.)and to compare various alignments by juxtaposing the obtained scores. However,it is substantially provocative to select one measure among others for comparison.On top of that, claiming if one system has a better performance than one another cannot be substantiated solely by comparing two scalars.In this paper,we propose the statistical procedures which enable us to theoretically favor one system over one another.The McNemar's test is the statistical means by which the comparison of two ontology alignment systems over one matching task is drawn.The test applies to a 2x2 contingency table which can be constructed in two different ways based on the alignments,each of which has their own merits/pitfalls.The ways of the contingency table construction and various apposite statistics from the McNemar's test are elaborated in minute detail.In the case of having more than two alignment systems for comparison, the family-wise error rate is expected to happen. Thus, the ways of preventing such an error are also discussed.A directed graph visualizes the outcome of the McNemar's test in the presence of multiple alignment systems.From this graph, it is readily understood if one system is better than one another or if their differences are imperceptible.The proposed statistical methodologies are applied to the systems participated in the OAEI 2016 anatomy track, and also compares several well-known similarity metrics for the same matching problem.
Visibility graphs for image processing
Iacovacci, Jacopo, Lacasa, Lucas
The family of image visibility graphs (IVGs) have been recently introduced as simple algorithms by which scalar fields can be mapped into graphs. Here we explore the usefulness of such operator in the scenario of image processing and image classification. We demonstrate that the link architecture of the image visibility graphs encapsulates relevant information on the structure of the images and we explore their potential as image filters and compressors. We introduce several graph features, including the novel concept of Visibility Patches, and show through several examples that these features are highly informative, computationally efficient and universally applicable for general pattern recognition and image classification tasks.
Instance Selection Improves Geometric Mean Accuracy: A Study on Imbalanced Data Classification
Kuncheva, Ludmila I., Arnaiz-Gonzรกlez, รlvar, Dรญez-Pastor, Josรฉ-Francisco, Gunn, Iain A. D.
A natural way of handling imbalanced data is to attempt to equalise the class frequencies and train the classifier of choice on balanced data. For two-class imbalanced problems, the classification success is typically measured by the geometric mean (GM) of the true positive and true negative rates. Here we prove that GM can be improved upon by instance selection, and give the theoretical conditions for such an improvement. We demonstrate that GM is non-monotonic with respect to the number of retained instances, which discourages systematic instance selection. We also show that balancing the distribution frequencies is inferior to a direct maximisation of GM. To verify our theoretical findings, we carried out an experimental study of 12 instance selection methods for imbalanced data, using 66 standard benchmark data sets. The results reveal possible room for new instance selection methods for imbalanced data.
Detecting Regions of Maximal Divergence for Spatio-Temporal Anomaly Detection
Barz, Bjรถrn, Rodner, Erik, Garcia, Yanira Guanche, Denzler, Joachim
Automatic detection of anomalies in space- and time-varying measurements is an important tool in several fields, e.g., fraud detection, climate analysis, or healthcare monitoring. We present an algorithm for detecting anomalous regions in multivariate spatio-temporal time-series, which allows for spotting the interesting parts in large amounts of data, including video and text data. In opposition to existing techniques for detecting isolated anomalous data points, we propose the "Maximally Divergent Intervals" (MDI) framework for unsupervised detection of coherent spatial regions and time intervals characterized by a high Kullback-Leibler divergence compared with all other data given. In this regard, we define an unbiased Kullback-Leibler divergence that allows for ranking regions of different size and show how to enable the algorithm to run on large-scale data sets in reasonable time using an interval proposal technique. Experiments on both synthetic and real data from various domains, such as climate analysis, video surveillance, and text forensics, demonstrate that our method is widely applicable and a valuable tool for finding interesting events in different types of data.
A comparative study of feature selection methods for stress hotspot classification in materials
Mangal, Ankita, Holm, Elizabeth A.
The first step in constructing a machine learning model is defining the features of the data set that can be used for optimal learning. In this work we discuss feature selection methods, which can be used to build better models, as well as achieve model interpretability. We applied these methods in the context of stress hotspot classification problem, to determine what microstructural characteristics can cause stress to build up in certain grains during uniaxial tensile deformation. The results show how some feature selection techniques are biased and demonstrate a preferred technique to get feature rankings for physical interpretations.
Stylistic Variation in Social Media Part-of-Speech Tagging
Balusu, Murali Raghu Babu, Merghani, Taha, Eisenstein, Jacob
However, this variation is often aligned with author attributes such as age, gender, and geography, as well as more readily-available social network metadata. In this paper, we report new evidence on the link between language and social networks in the task of part-of-speech tagging. We find that tagger error rates are correlated with network structure, with high accuracy in some parts of the network, and lower accuracy elsewhere. As a result, tagger accuracy depends on training from a balanced sample of the network, rather than training on texts from a narrow subcommunity. We also describe our attempts to add robustness to stylistic variation, by building a mixture-of-experts model in which each expert is associated with a region of the social network. While prior work found that similar approaches yield performance improvements in sentiment analysis and entity linking, we were unable to obtain performance improvements in part-of-speech tagging, despite strong evidence for the link between part-of-speech error rates and social network structure.