Goto

Collaborating Authors

 Accuracy


LoRAS: An oversampling approach for imbalanced datasets

arXiv.org Machine Learning

The Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our LoRAS algorithm with 28 publicly available datasets and show that that drawing samples from an approximated data manifold of the minority class is the key to successful oversampling. We compared the performance of LoRAS, SMOTE, and several SMOTE extensions and observed that for imbalanced datasets LoRAS, on average generates better Machine Learning (ML) models in terms of F1-score and Balanced Accuracy. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS is a more effective oversampling technique since it provides a better estimate to mean of the underlying local data distribution of the minority class data space.


Automated Generation of Test Models from Semi-Structured Requirements

arXiv.org Machine Learning

[Context:] Model-based testing is an instrument for automated generation of test cases. It requires identifying requirements in documents, understanding them syntactically and semantically, and then translating them into a test model. One light-weight language for these test models are Cause-Effect-Graphs (CEG) that can be used to derive test cases. [Problem:] The creation of test models is laborious and we lack an automated solution that covers the entire process from requirement detection to test model creation. In addition, the majority of requirements is expressed in natural language (NL), which is hard to translate to test models automatically. [Principal Idea:] We build on the fact that not all NL requirements are equally unstructured. We found that 14 % of the lines in requirements documents of our industry partner contain "pseudo-code"-like descriptions of business rules. We apply Machine Learning to identify such semi-structured requirements descriptions and propose a rule-based approach for their translation into CEGs. [Contribution:] We make three contributions: (1) an algorithm for the automatic detection of semi-structured requirements descriptions in documents, (2) an algorithm for the automatic translation of the identified requirements into a CEG and (3) a study demonstrating that our proposed solution leads to 86 % time savings for test model creation without loss of quality.


Viability of machine learning to reduce workload in systematic review screenings in the health sciences: a working paper

arXiv.org Machine Learning

Systematic reviews, which summarize and synthesize all the current research in a specific topic, are a crucial component to academia. They are especially important in the biomedical and health sciences, where they synthesize the state of medical evidence and conclude the best course of action for various diseases, pathologies, and treatments. Due to the immense amount of literature that exists, as well as the output rate of research, reviewing abstracts can be a laborious process. Automation may be able to significantly reduce this workload. Of course, such classifications are not easily automated due to the peculiar nature of written language. Machine learning may be able to help. This paper explored the viability and effectiveness of using machine learning modelling to classify abstracts according to specific exclusion/inclusion criteria, as would be done in the first stage of a systematic review. The specific task was performing the classification of deciding whether an abstract is a randomized control trial (RCT) or not, a very common classification made in systematic reviews in the healthcare field. Random training/testing splits of an n=2042 dataset of labelled abstracts were repeatedly created (1000 times in total), with a model trained and tested on each of these instances. A Bayes classifier as well as an SVM classifier were used, and compared to non-machine learning, simplistic approaches to textual classification. An SVM classifier was seen to be highly effective, yielding a 90% accuracy, as well as an F1 score of 0.84, and yielded a potential workload reduction of 70%. This shows that machine learning has the potential to significantly revolutionize the abstract screening process in healthcare systematic reviews.


Song Hit Prediction: Predicting Billboard Hits Using Spotify Data

arXiv.org Machine Learning

In this work, we attempt to solve the Hit Song Science problem, which aims to predict which songs will become chart-topping hits. We constructed a dataset with approximately 1.8 million hit and non-hit songs and extracted their audio features using the Spotify Web API. We test four models on our dataset. Our best model was random forest, which was able to predict Billboard song success with 88% accuracy.


Denoising and Verification Cross-Layer Ensemble Against Black-box Adversarial Attacks

arXiv.org Machine Learning

--Deep neural networks (DNNs) have demonstrated impressive performance on many challenging machine learning tasks. However, DNNs are vulnerable to adversarial inputs generated by adding maliciously crafted perturbations to the benign inputs. As a growing number of attacks have been reported to generate adversarial inputs of varying sophistication, the defense-attack arms race has been accelerated. MODEF intelligently combines unsupervised model denoising ensemble with supervised model verification ensemble by quantifying model diversity, aiming to boost the robustness of the target model against adversarial examples. Evaluated using eleven representative attacks on popular benchmark datasets, we show that MODEF achieves remarkable defense success rates, compared with existing defense methods, and provides a superior capability of repairing adversarial inputs and making correct predictions with high accuracy in the presence of black-box attacks. The recent advances in deep neural networks (DNNs) have powered numerous applications in different domains due to their outstanding performance compared to traditional machine learning techniques. However, it has been shown that DNNs can be easily fooled by adversarial inputs [1], making them become a double-edged sword as the vulnerability of DNNs to adversarial attacks has posed serious threats to many security-critical applications, such as biometric authentication and autonomous driving. As a number of defenses are being proposed, more attacks of varying sophistication have been put forward, accelerating the defense-attack arms race. Some even argue that designing new attacks requires much less efforts than developing effective defenses. Thus, improving the robustness and defensibility against adversarial attacks is crucial. Adversarial examples are generated by maliciously perturbing benign examples sent to the target DNN model through querying its prediction API, aiming to fool and mislead the target model to misclassify by producing incorrect predictions randomly (untargeted attack) or purposefully (targeted attack).


Learning Fair Classifiers in Online Stochastic Settings

arXiv.org Machine Learning

In many real life situations, including job and loan applications, gatekeepers must make justified, real-time decisions about a person's fitness for a particular opportunity. People on both sides of such decisions have understandable concerns about their fairness, especially when they occur online or algorithmically. In this paper we consider the setting where we try to satisfy approximate fairness in an online decision making process where examples are sampled i.i.d from an underlying distribution. The fairness metric we consider is "equalized odds", which requires that approximately equalized false positive rates and false negative rates across groups. Our work follows from the classical learning from experts scheme and extends the multiplicative weights algorithm by maintaining an estimation for label distribution and keeping separate weights for label classes as well as groups. Our theoretical results show that approximate equalized odds can be achieved without sacrificing much regret from some distributions. We also demonstrate the algorithm on real data sets commonly used by the fairness community.


The efficacy of various machine learning models for multi-class classification of RNA-seq expression data

arXiv.org Machine Learning

Late diagnosis and high costs are key factors that negatively impact the care of cancer patients worldwide. Although the availability of biological markers for the diagnosis of cancer type is increasing, costs and reliability of tests currently present a barrier to the adoption of their routine use. There is a pressing need for accurate methods that enable early diagnosis and cover a broad range of cancers. The use of machine learning and RNA-seq expression analysis has shown promise in the classification of cancer type. However, research is inconclusive about which type of machine learning models are optimal. The suitability of five algorithms were assessed for the classification of 17 different cancer types. Each algorithm was fine-tuned and trained on the full array of 18,015 genes per sample, for 4,221 samples (75 % of the dataset). They were then tested with 1,408 samples (25 % of the dataset) for which cancer types were withheld to determine the accuracy of prediction. The results show that ensemble algorithms achieve 100% accuracy in the classification of 14 out of 17 types of cancer. The clustering and classification models, while faster than the ensembles, performed poorly due to the high level of noise in the dataset. When the features were reduced to a list of 20 genes, the ensemble algorithms maintained an accuracy above 95% as opposed to the clustering and classification models.


Distinction Maximization Loss: Fast, Scalable, Turnkey, and Native Neural Networks Out-of-Distribution Detection simply by Replacing the SoftMax Loss

arXiv.org Machine Learning

Recently, many methods to reduce neural networks uncertainty have been proposed. However, most of the techniques used in these solutions usually present severe drawbacks. In this paper, we argue that neural networks low out-of-distribution detection performance is mainly due to the SoftMax loss anisotropy. Therefore, we built an isotropic loss to reduce neural networks uncertainty in a fast, scalable, turnkey, and native approach. Our experiments show that replacing SoftMax with the proposed loss does not affect classification accuracy. Moreover, our proposal overcomes ODIN typically by a large margin while producing usually competitive results against a state-of-the-art Mahalanobis method despite avoiding their limitations. Hence, neural networks uncertainty may be significantly reduced by a simple loss change without relying on special procedures such as data augmentation, adversarial training/validation, ensembles, or additional classification/regression models.


Neural Network Based Undersampling Techniques

arXiv.org Machine Learning

Class imbalance problem is commonly faced while developing machine learning models for real-life issues. Due to this problem, the fitted model tends to be biased towards the majority class data, which leads to lower precision, recall, AUC, F1, G-mean score. Several researches have been done to tackle this problem, most of which employed resampling, i.e. oversampling and undersampling techniques to bring the required balance in the data. In this paper, we propose neural network based algorithms for undersampling. Then we resampled several class imbalanced data using our algorithms and also some other popular resampling techniques. Afterwards we classified these undersampled data using some common classifier. We found out that our resampling approaches outperform most other resampling techniques in terms of both AUC, F1 and G-mean score.


eSports Pro-Players Behavior During the Game Events: Statistical Analysis of Data Obtained Using the Smart Chair

arXiv.org Artificial Intelligence

--T oday's competition between the professional eSports teams is so strong that in-depth analysis of players' performance literally crucial for creating a powerful team. There are two main approaches to such an estimation: obtaining features and metrics directly from the in-game data or collecting detailed information about the player including data on his/her physical training. While the correlation between the player's skill and in-game data has already been covered in many papers, there are very few works related to analysis of eSports athlete's skill through his/her physical behavior . We propose the smart chair platform which is to collect data on the person's behavior on the chair using an integrated accelerometer, a gyroscope and a magnetometer . We extract the important game events to define the players' physical reactions to them. The obtained data are used for training machine learning models in order to distinguish between the low-skilled and high-skilled players. We extract and figure out the key features during the game and discuss the results. I NTRODUCTION Nowadays eSports is a rapidly growing industry with more than billion players involved worldwide.