Goto

Collaborating Authors

 Support Vector Machines


A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training

arXiv.org Machine Learning

Y adav and BodeMETHODOLOGY A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training Sumedh Y adav 1* and Mathis Bode 2 * Correspondence: sumedhyadav.iitkgp@gmail.com 1 Gstech T echnology Pvt. Ltd., 415, 2nd Floor, 16th Cross Road, 17th Main Road, HSR Layout Sector 4, 560102, Bengaluru, India Full list of author information is available at the end of the article Abstract A scalable graphical method is presented for selecting, and partitioning datasets for the training phase of a classification task. For the heuristic, a clustering algorithm is required to get its computation cost in a reasonable proportion to the task itself. This step is proceeded by construction of an information graph of the underlying classification patterns using approximate nearest neighbor methods. The presented method constitutes of two approaches, one for reducing a given training set, and another for partitioning the selected/reduced set. The heuristic targets large datasets, since the primary goal is significant reduction in training computation run-time without compromising prediction accuracy . T est results show that both approaches significantly speedup the training task when compared against that of state-of-the-art shrinking heuristic available in LIBSVM. Furthermore, the approaches closely follow or even outperform in prediction accuracy . A network design is also presented for the partitioning based distributed training formulation. Added speedup in training run-time is observed when compared to that of serial implementation of the approaches. Keywords: training set selection; machine learning; large datasets; distributed machine learning; classification; graph coarsening objective; network architecture design Introduction Two decades earlier, some of the most seminal works in machine learning were done on training set selection [1, 2] under the banner of relevance reasoning. However, the better part of recent works have been exclusively towards feature selection [3, 4]. With increased processing power, run time of training is feasible even for datasets erstwhile considered large. Additionally, dimensionality ( d) dominates dataset size ( n) in the algorithmic complexities of learning algorithms. In the training phase, less data points mean fewer generalization guarantees, however, as we are moving in the era of big data, even the fastest classification algorithms are taking unfeasible time to train models. When data sources are abundant, it is befitting to separate data based on relevance to the learning task. This has led to a renewed interest in the once famous problem statement of relevance reasoning [5, 6]. Reasoning on relevance to get improved scalability of classification algorithms is currently explored on graphical/network data [7], and learned models [8]. One research area where training set selection has been given attention to is support vector machines (SVM).


Incremental and Decremental Fuzzy Bounded Twin Support Vector Machine

arXiv.org Machine Learning

In this paper we present an incremental variant of the Twin Support Vector Machine (TWSVM) called Fuzzy Bounded Twin Support Vector Machine (FBTWSVM) to deal with large datasets and learning from data streams. We combine the TWSVM with a fuzzy membership function, so that each input has a different contribution to each hyperplane in a binary classifier. To solve the pair of quadratic programming problems (QPPs) we use a dual coordinate descent algorithm with a shrinking strategy, and to obtain a robust classification with a fast training we propose the use of a Fourier Gaussian approximation function with our linear FBTWSVM. Inspired by the shrinking technique, the incremental algorithm re-utilizes part of the training method with some heuristics, while the decremental procedure is based on a scored window. The FBTWSVM is also extended for multi-class problems by combining binary classifiers using a Directed Acyclic Graph (DAG) approach. Moreover, we analyzed the theoretical foundations properties of the proposed approach and its extension, and the experimental results on benchmark datasets indicate that the FBTWSVM has a fast training and retraining process while maintaining a robust classification performance.


Electroencephalography based Classification of Long-term Stress using Psychological Labeling

arXiv.org Machine Learning

Stress research is a rapidly emerging area in thefield of electroencephalography (EEG) based signal processing.The use of EEG as an objective measure for cost effective andpersonalized stress management becomes important in particularsituations such as the non-availability of mental health facilities.In this study, long-term stress is classified using baseline EEGsignal recordings. The labelling for the stress and control groupsis performed using two methods (i) the perceived stress scalescore and (ii) expert evaluation. The frequency domain featuresare extracted from five-channel EEG recordings in addition tothe frontal and temporal alpha and beta asymmetries. The alphaasymmetry is computed from four channels and used as a feature.Feature selection is also performed using a t-test to identifystatistically significant features for both stress and control groups.We found that support vector machine is best suited to classifylong-term human stress when used with alpha asymmetry asa feature. It is observed that expert evaluation based labellingmethod has improved the classification accuracy up to 85.20%.Based on these results, it is concluded that alpha asymmetry maybe used as a potential bio-marker for stress classification, when labels are assigned using expert evaluation.


Image Evolution Trajectory Prediction and Classification from Baseline using Learning-based Patch Atlas Selection for Early Diagnosis

arXiv.org Machine Learning

Patients initially diagnosed with early mild cognitive impairment (eMCI) are known to be a clinically heterogeneous group with very subtle patterns of brain atrophy. To examine the boarders between normal controls (NC) and eMCI, Magnetic Resonance Imaging (MRI) was extensively used as a non-invasive imaging modality to pin-down subtle changes in brain images of MCI patients. However, eMCI research remains limited by the number of available MRI acquisition timepoints. Ideally, one would learn how to diagnose MCI patients in an early stage from MRI data acquired at a single timepoint, while leveraging 'non-existing' follow-up observations. To this aim, we propose novel supervised and unsupervised frameworks that learn how to jointly predict and label the evolution trajectory of intensity patches, each seeded at a specific brain landmark, from a baseline intensity patch. Specifically, both strategies aim to identify the best training atlas patches at baseline timepoint to predict and classify the evolution trajectory of a given testing baseline patch. The supervised technique learns how to select the best atlas patches by training bidirectional mappings from the space of pairwise patch similarities to their corresponding prediction errors -when one patch was used to predict the other. On the other hand, the unsupervised technique learns a manifold of baseline atlas and testing patches using multiple kernels to well capture patch distributions at multiple scales. Once the best baseline atlas patches are selected, we retrieve their evolution trajectories and average them to predict the evolution trajectory of the testing baseline patch. Next, we input the predicted trajectories to an ensemble of linear classifiers, each trained at a specific landmark. Our classification accuracy increased by up to 10% points in comparison to single timepoint-based classification methods.


A Study and Analysis of a Feature Subset Selection Technique using Penguin Search Optimization Algorithm (FS-PeSOA)

arXiv.org Machine Learning

In today world of enormous amounts of data, it is very important to extract useful knowledge from it. This can be accomplished by feature subset selection. Feature subset selection is a method of selecting a minimum number of features with the help of which our machine can learn and predict which class a particular data belongs to. We will introduce a new adaptive algorithm called Feature selection Penguin Search optimization algorithm which is a metaheuristic approach. It is adapted from the natural hunting strategy of penguins in which a group of penguins take jumps at random depths and come back and share the status of food availability with other penguins and in this way, the global optimum solution is found. In order to explore the feature subset candidates, the bioinspired approach Penguin Search optimization algorithm generates during the process a trial feature subset and estimates its fitness value by using three different classifiers for each case: Random Forest, Nearest Neighbour and Support Vector Machines. However, we are planning to implement our proposed approach Feature selection Penguin Search optimization algorithm on some well known benchmark datasets collected from the UCI repository and also try to evaluate and compare its classification accuracy with some state of art algorithms.


Building Machine Learning Models via Comparisons

#artificialintelligence

Nowadays most machine learning (ML) models predict labels from features. In classification tasks, an ML model predicts a categorical value and in regression tasks, an ML model predicts a real value. These ML models thus require a large amount of feature-label pairs. While in practice it is not hard to obtain features, it is often costly to obtain labels because this requires human labor. Can we learn a model without too many feature-label pairs?


Machine Learning based Prediction of Hierarchical Classification of Transposable Elements

arXiv.org Machine Learning

Transposable Elements (TEs) or jumping genes are the DNA sequences that have an intrinsic capability to move within a host genome from one genomic location to another. Studies show that the presence of a TE within or adjacent to a functional gene may alter its expression. TEs can also cause an increase in the rate of mutation and can even mediate duplications and large insertions and deletions in the genome, promoting gross genetic rearrangements. Thus, the proper classification of the identified jumping genes is essential to understand their genetic and evolutionary effects in the genome. While computational methods have been developed that perform either binary classification or multi-label classification of TEs, few studies have focused on their hierarchical classification. The state-of-the-art machine learning classification method utilizes a Multi-Layer Perceptron (MLP), a class of neural network, for hierarchical classification of TEs. However, the existing methods have limited accuracy in classifying TEs. A more effective classifier, which can explain the role of TEs in germline and somatic evolution, is needed. In this study, we examine the performance of a variety of machine learning (ML) methods. And eventually, propose a robust approach for the hierarchical classification of TEs, with higher accuracy, using Support Vector Machines (SVM).


Privacy-Preserving Classification with Secret Vector Machines

arXiv.org Machine Learning

Today, large amounts of valuable data are distributed among millions of user-held devices, such as personal computers, phones, or Internet-of-things devices. Many companies collect such data with the goal of using it for training machine learning models allowing them to improve their services. However, user-held data is often sensitive, and collecting it is problematic in terms of privacy. We address this issue by proposing a novel way of training a supervised classifier in a distributed setting akin to the recently proposed federated learning paradigm (McMahan et al. 2017), but under the stricter privacy requirement that the server that trains the model is assumed to be untrusted and potentially malicious; we thus preserve user privacy by design, rather than by trust. In particular, our framework, called secret vector machine (SecVM), provides an algorithm for training linear support vector machines (SVM) in a setting in which data-holding clients communicate with an untrusted server by exchanging messages designed to not reveal any personally identifiable information. We evaluate our model in two ways. First, in an offline evaluation, we train SecVM to predict user gender from tweets, showing that we can preserve user privacy without sacrificing classification performance. Second, we implement SecVM's distributed framework for the Cliqz web browser and deploy it for predicting user gender in a large-scale online evaluation with thousands of clients, outperforming baselines by a large margin and thus showcasing that SecVM is practicable in production environments. Overall, this work demonstrates the feasibility of machine learning on data from thousands of users without collecting any personal data. We believe this is an innovative approach that will help reconcile machine learning with data privacy.


Resource-Efficient Computing in Wearable Systems

arXiv.org Machine Learning

We propose two optimization techniques to minimize memory usage and computation while meeting system timing constraints for real-time classification in wearable systems. Our method derives a hierarchical classifier structure for Support Vector Machine (SVM) in order to reduce the amount of computations, based on the probability distribution of output classes occurrences. Also, we propose a memory optimization technique based on SVM parameters, which results in storing fewer support vectors and as a result requiring less memory. To demonstrate the efficiency of our proposed techniques, we performed an activity recognition experiment and were able to save up to 35% and 56% in memory storage when classifying 14 and 6 different activities, respectively. In addition, we demonstrated that there is a trade-off between accuracy of classification and memory savings, which can be controlled based on application requirements.


Gathering Cyber Threat Intelligence from Twitter Using Novelty Classification

arXiv.org Machine Learning

Preventing organizations from Cyber exploits needs timely intelligence about Cyber vulnerabilities and attacks, referred as threats. Cyber threat intelligence can be extracted from various sources including social media platforms where users publish the threat information in real time. Gathering Cyber threat intelligence from social media sites is a time consuming task for security analysts that can delay timely response to emerging Cyber threats. We propose a framework for automatically gathering Cyber threat intelligence from Twitter by using a novelty detection model. Our model learns the features of Cyber threat intelligence from the threat descriptions published in public repositories such as Common Vulnerabilities and Exposures (CVE) and classifies a new unseen tweet as either normal or anomalous to Cyber threat intelligence. We evaluate our framework using a purpose-built data set of tweets from 50 influential Cyber security related accounts over twelve months (in 2018). Our classifier achieves the F1-score of 0.643 for classifying Cyber threat tweets and outperforms several baselines including binary classification models. Our analysis of the classification results suggests that Cyber threat relevant tweets on Twitter do not often include the CVE identifier of the related threats. Hence, it would be valuable to collect these tweets and associate them with the related CVE identifier for cyber security applications.