Support Vector Machines
Finding Better Active Learners for Faster Literature Reviews
Yu, Zhe, Kraft, Nicholas A., Menzies, Tim
Literature reviews can be time-consuming and tedious to complete. By cataloging and refactoring three state-of-the-art active learning techniques from evidence-based medicine and legal electronic discovery, this paper finds and implements FASTREAD, a faster technique for studying a large corpus of documents. This paper assesses FASTREAD using datasets generated from existing SE literature reviews (Hall, Wahono, Radjenovi\'c, Kitchenham et al.). Compared to manual methods, FASTREAD lets researchers find 95% relevant studies after reviewing an order of magnitude fewer papers. Compared to other state-of-the-art automatic methods, FASTREAD reviews 20-50% fewer studies while finding same number of relevant primary studies in a systematic literature review.
Fast Incremental SVDD Learning Algorithm with the Gaussian Kernel
Jiang, Hansi, Wang, Haoyu, Hu, Wenhao, Kakde, Deovrat, Chaudhuri, Arin
Support vector data description (SVDD) is a machine learning technique that is used for single-class classification and outlier detection. The idea of SVDD is to find a set of support vectors that defines a boundary around data. When dealing with online or large data, existing batch SVDD methods have to be rerun in each iteration. We propose an incremental learning algorithm for SVDD that uses the Gaussian kernel. This algorithm builds on the observation that all support vectors on the boundary have the same distance to the center of sphere in a higher-dimensional feature space as mapped by the Gaussian kernel function. Each iteration involves only the existing support vectors and the new data point. Moreover, the algorithm is based solely on matrix manipulations; the support vectors and their corresponding Lagrange multiplier $\alpha_i$'s are automatically selected and determined in each iteration. It can be seen that the complexity of our algorithm in each iteration is only $O(k^2)$, where $k$ is the number of support vectors. Experimental results on some real data sets indicate that FISVDD demonstrates significant gains in efficiency with almost no loss in either outlier detection accuracy or objective function value.
Engineering fast multilevel support vector machines
Sadrfaridpour, E., Razzaghi, T., Safro, I.
Support vector machine (SVM) is one of the most well-known supervised classification methods that has been extensively used in such fields as disease diagnosis, text categorization, and fraud detection. Training nonlinear SVM classifier (such as Gaussian kernel based) requires solving convex quadratic programming (QP) model whose running time can be prohibitive for large-scale instances without using specialized acceleration techniques such as sampling, boosting, and hierarchical training. Another typical reason of increased running time is complex data sets (e.g., when the data is noisy, imbalanced, or incomplete) that require using model selection techniques for finding the best model parameters. The motivation behind this work was extensive applied experience with hard, large-scale, industrial (not necessarily highly heterogeneous) data sets for which fast linear SVMs produced extremely low quality results (as well as many other fast methods), and various nonlinear SVMs exhibited a strong trade off between running time and quality. It has been noticed in multiple works that many different real-world data sets have a strong underlying multiscale (in some works called hierarchical) structure [35, 31, 37, 66] that can be discovered through careful definitions of coarse-grained resolutions.
Machine learning for graph-based representations of three-dimensional discrete fracture networks
Valera, Manuel, Guo, Zhengyang, Kelly, Priscilla, Matz, Sean, Cantu, Vito Adrian, Percus, Allon G., Hyman, Jeffrey D., Srinivasan, Gowri, Viswanathan, Hari S.
Structural and topological information play a key role in modeling flow and transport through fractured rock in the subsurface. Discrete fracture network (DFN) computational suites such as dfnWorks are designed to simulate flow and transport in such porous media. Flow and transport calculations reveal that a small backbone of fractures exists, where most flow and transport occurs. Restricting the flowing fracture network to this backbone provides a significant reduction in the network's effective size. However, the particle tracking simulations needed to determine the reduction are computationally intensive. Such methods may be impractical for large systems or for robust uncertainty quantification of fracture networks, where thousands of forward simulations are needed to bound system behavior. In this paper, we develop an alternative network reduction approach to characterizing transport in DFNs, by combining graph theoretical and machine learning methods. We consider a graph representation where nodes signify fractures and edges denote their intersections. Using random forest and support vector machines, we rapidly identify a subnetwork that captures the flow patterns of the full DFN, based primarily on node centrality features in the graph. Our supervised learning techniques train on particle-tracking backbone paths found by dfnWorks, but run in negligible time compared to those simulations. We find that our predictions can reduce the network to approximately 20% of its original size, while still generating breakthrough curves consistent with those of the original network.
Solving for multi-class using orthogonal coding matrices
Probability estimates are desirable in statistical classification both for gauging the accuracy of a classification result and for calibration. Here we describe a method of solving for the conditional probabilities in multi-class classification using orthogonal error correcting codes. The method is tested on six different datasets using support vector machines and compares favorably with an existing technique based on the one-versus-one multi-class method. Probabilities are validated based on the cumulative sum of a boolean evaluation of the correctness of the class label divided by the estimated probability. Probability estimation using orthogonal coding is simple and efficient and has the potential for faster classification results than the one-versus-one method.
Spoken English Intelligibility Remediation with PocketSphinx Alignment and Feature Extraction Improves Substantially over the State of the Art
Gao, Yuan, Srivastava, Brij Mohan Lal, Salsman, James
ABSTRACT We use automatic speech recognition to assess spoken English learner pronunciation based on the authentic intelligibility of the learners' spoken responses determined from support vector machine (SVM) classifier or deep learning neural network model predictions of transcription correctness. Using numeric features produced by PocketSphinx alignment mode and many recognition passes searching for the substitution and deletion of each expected phoneme and insertion of unexpected phonemes in sequence, the SVM models achieve 82% agreement with the accuracy of Amazon Mechanical Turk crowdworker transcriptions, up from 75% reported by multiple independent researchers. Using such features with SVM classifier probability prediction models can help computeraided pronunciation teaching (CAPT) systems provide intelligibility remediation. Index Terms-- phoneme alignment, pronunciation assessment, computer aided language learning, binary features 1. INTRODUCTION Authentic intelligibility, the ability of listeners to correctly transcribe recorded utterances, initially used for CAPT by [1] and [2], is a better measure of pronunciation assessment for spoken language learners compared to mispronunciations identified by expert pronunciation judges or panels of experts, because such mispronunciations are associated with only 16% of intelligibility problems, according to [3], who state: We investigated... which words are likely to be misrecognized and which words are likely to be marked as pronunciation errors. Words perceived as mispronounced remain intelligible in about half of all cases.
Support Vector Machine Active Learning Algorithms with Query-by-Committee versus Closest-to-Hyperplane Selection
The use of active learning has received a lot of interest for reducing annotation costs for text and speech processing applications [1], [2], [3], [4], [5], [6]. Many applications have the following three characteristics: 1) they have imbalanced data sets, 2) training data annotation is a burden, and 3) support vector machines (SVMs) are able to train highperforming systems for the application. Two examples of such applications are Text Classification (TC) and Relation Extraction (RE). Characteristics 2 and 3 suggest the use of AL-SVM (Active Learning (AL) with Support Vector Machines). Previous work has presented an AL-SVM algorithm that selects (i.e., requests labels for) the examples that are closest to the current model's hyperplane [7], [8], [9], [10]. This "closest"-based algorithm has been shown to need modification for imbalanced data situations [11]. Previous work has presented a method for adapting to imbalanced data situations in the context of AL-SVM by using asymmetric cost factors during model training [11]. The asymmetric cost model has been shown to be most effective when the model is based on prevalence statistics from an unbiased initial sample of data and serves as positive amplification for the minority positive examples.
Support Vector Machines for Binary Classification - MATLAB & Simulink
You can use a support vector machine (SVM) when your data has exactly two classes. An SVM classifies data by finding the best hyperplane that separates all data points of one class from those of the other class. The best hyperplane for an SVM means the one with the largest margin between the two classes. Margin means the maximal width of the slab parallel to the hyperplane that has no interior data points. The support vectors are the data points that are closest to the separating hyperplane; these points are on the boundary of the slab.
The Value of Semi-Supervised Machine Learning
Your boss hands you a pile of a 100,000 unlabeled images and asks you to categorize whether they are sandals, pants, boots, etc. So now you have a massive set of unlabeled data and you need labels. Lots of companies are swimming with data, whether its transactional, IoT sensors, security logs, images, voice, or more, and its all unlabeled. With so little labeled data, it is a tedious and slow process for data scientists to build machine learning models in most all enterprises. Take Google's street view data. Gebru had to figure out how to label cars in 50 million images with very little labeled data.
Generalizing, Decoding, and Optimizing Support Vector Machine Classification
The classification of complex data usually requires the composition of processing steps. Here, a major challenge is the selection of optimal algorithms for preprocessing and classification (including parameterizations). Nowadays, parts of the optimization process are automized but expert knowledge and manual work are still required. We present three steps to face this process and ease the optimization. Namely, we take a theoretical view on classical classifiers, provide an approach to interpret the classifier together with the preprocessing, and integrate both into one framework which enables a semiautomatic optimization of the processing chain and which interfaces numerous algorithms.