Goto

Collaborating Authors

 Performance Analysis


An in-depth guide to supervised machine learning classification

#artificialintelligence

In supervised learning, algorithms learn from labeled data. After understanding the data, the algorithm determines which label should be given to new data by associating patterns to the unlabeled new data. Supervised learning can be divided into two categories: classification and regression. Some examples of classification include spam detection, churn prediction, sentiment analysis, dog breed detection and so on. Some examples of regression include house price prediction, stock price prediction, height-weight prediction and so on.


Evaluating Usage of Images for App Classification

arXiv.org Machine Learning

App classification is useful in a number of applications such as adding apps to an app store or building a user model based on the installed apps. Presently there are a number of existing methods to classify apps based on a given taxonomy on the basis of their text metadata. However, text based methods for app classification may not work in all cases, such as when the text descriptions are in a different language, or missing, or inadequate to classify the app. One solution in such cases is to utilize the app images to supplement the text description. In this paper, we evaluate a number of approaches in which app images can be used to classify the apps. In one approach, we use Optical character recognition (OCR) to extract text from images, which is then used to supplement the text description of the app. In another, we use pic2vec to convert the app images into vectors, then train an SVM to classify the vectors to the correct app label. In another, we use the captionbot.ai tool to generate natural language descriptions from the app images. Finally, we use a method to detect and label objects in the app images and use a voting technique to determine the category of the app based on all the images. We compare the performance of our image-based techniques to classify a number of apps in our dataset. We use a text based SVM app classifier as our base and obtained an improved classification accuracy of 96% for some classes when app images are added.


Predicting the Outcome of Judicial Decisions made by the European Court of Human Rights

arXiv.org Machine Learning

In this study, machine learning models were constructed to predict whether judgments made by the European Court of Human Rights (ECHR) would lead to a violation of an Article in the Convention on Human Rights. The problem is framed as a binary classification task where a judgment can lead to a "violation" or "non-violation" of a particular Article. Using auto-sklearn, an automated algorithm selection package, models were constructed for 12 Articles in the Convention. To train these models, textual features were obtained from the ECHR Judgment documents using N-grams, word embeddings and paragraph embeddings. Additional documents, from the ECHR, were incorporated into the models through the creation of a word embedding (echr2vec) and a doc2vec model. The features obtained using the echr2vec embedding provided the highest cross-validation accuracy for 5 of the Articles. The overall test accuracy, across the 12 Articles, was 68.83%. As far as we could tell, this is the first estimate of the accuracy of such machine learning models using a realistic test set. This provides an important benchmark for future work. As a baseline, a simple heuristic of always predicting the most common outcome in the past was used. The heuristic achieved an overall test accuracy of 86.68% which is 29.7% higher than the models. Again, this was seemingly the first study that included such a heuristic with which to compare model results. The higher accuracy achieved by the heuristic highlights the importance of including such a baseline.


Multi-stream Data Analytics for Enhanced Performance Prediction in Fantasy Football

arXiv.org Machine Learning

Fantasy Premier League (FPL) performance predictors tend to base their algorithms purely on historical statistical data. The main problems with this approach is that external factors such as injuries, managerial decisions and other tournament match statistics can never be factored into the final predictions. In this paper, we present a new method for predicting future player performances by automatically incorporating human feedback into our model. Through statistical data analysis such as previous performances, upcoming fixture difficulty ratings, betting market analysis, opinions of the general-public and experts alike via social media and web articles, we can improve our understanding of who is likely to perform well in upcoming matches. When tested on the English Premier League 2018/19 season, the model outperformed regular statistical predictors by over 300 points, an average of 11 points per week, ranking within the top 0.5% of players rank 30,000 out of over 6.5 million players.


Fairness Assessment for Artificial Intelligence in Financial Industry

arXiv.org Machine Learning

Artificial Intelligence (AI) is an important driving force for the development and transformation of the financial industry. However, with the fast-evolving AI technology and application, unintentional bias, insufficient model validation, immature contingency plan and other underestimated threats may expose the company to operational and reputational risks. In this paper, we focus on fairness evaluation, one of the key components of AI Governance, through a quantitative lens. Statistical methods are reviewed for imbalanced data treatment and bias mitigation. These methods and fairness evaluation metrics are then applied to a credit card default payment example.


Na\"iveRole: Author-Contribution Extraction and Parsing from Biomedical Manuscripts

arXiv.org Machine Learning

Information about the contributions of individual authors to scientific publications is important for assessing authors' achievements. Some biomedical publications have a short section that describes authors' roles and contributions. It is usually written in natural language and hence author contributions cannot be trivially extracted in a machine-readable format. In this paper, we present 1) A statistical analysis of roles in author contributions sections, and 2) Na\"iveRole, a novel approach to extract structured authors' roles from author contribution sections. For the first part, we used co-clustering techniques, as well as Open Information Extraction, to semi-automatically discover the popular roles within a corpus of 2,000 contributions sections from PubMed Central. The discovered roles were used to automatically build a training set for Na\"iveRole, our role extractor approach, based on Na\"ive Bayes. Na\"iveRole extracts roles with a micro-averaged precision of 0.68, recall of 0.48 and F1 of 0.57. It is, to the best of our knowledge, the first attempt to automatically extract author roles from research papers. This paper is an extended version of a previous poster published at JCDL 2018.


Diagnosis of liver disease using computer-assisted imaging techniques: A Review

arXiv.org Machine Learning

The evidence says that liver disease detection using CAD is one of the most efficient techniques but the presence of better organization of studies and the performance parameters to represent the result analysis of the proposed techniques are pointedly missing in most of the recent studies. Few benchmarked studies have been found in some of the papers as benchmarking makes a reader understand that under which circumstances their experimental results or outcomes are better and useful for the future implementation and adoption of the work. Liver diseases and image processing algorithms, especially in medicine, are the most important and important topics of the day. Unfortunately, the necessary data and data, as they are invoked in the articles, are low in this area and require the revision and implementation of policies in order to gather and do more research in this field. Detection with ultrasound is quite normal in liver diseases and depends on the physician's experience and skills. CAD systems are very important for doctors to understand medical images and improve the accuracy of diagnosing various diseases. In the following, we describe the techniques used in the various stages of a CAD system, namely: extracting features, selecting features, and classifying them. Although there are many techniques that are used to classify medical images, it is still a challenging issue for creating a universally accepted approach.


A novel spike-and-wave automatic detection in EEG signals

arXiv.org Machine Learning

Spike-and-wave discharge (SWD) pattern classification in electroencephalography (EEG) signals is a key problem in signal processing. It is particularly important to develop a SWD automatic detection method in long-term EEG recordings since the task of marking the patters manually is time consuming, difficult and error-prone. This paper presents a new detection method with a low computational complexity that can be easily trained if standard medical protocols are respected. The detection procedure is as follows: First, each EEG signal is divided into several time segments and for each time segment, the Morlet 1-D decomposition is applied. Then three parameters are extracted from the wavelet coefficients of each segment: scale (using a generalized Gaussian statistical model), variance and median. This is followed by a k-nearest neighbors (k-NN) classifier to detect the spike-and-wave pattern in each EEG channel from these three parameters. A total of 106 spike-and-wave and 106 non-spike-and-wave were used for training, while 69 new annotated EEG segments from six subjects were used for classification. In these circumstances, the proposed methodology achieved 100% accuracy. These results generate new research opportunities for the underlying causes of the so-called absence epilepsy in long-term EEG recordings.


Breast Cancer Diagnosis by Higher-Order Probabilistic Perceptrons

arXiv.org Machine Learning

A two-layer neural network model that systematically includes correlations among input variables to arbitrary order and is designed to implement Bayes inference has been adapted to classify breast cancer tumors as malignant or benign, assigning a probability for either outcome. The inputs to the network represent measured characteristics of cell nuclei imaged in Fine Needle Aspiration biopsies. The present machine-learning approach to diagnosis (known as HOPP, for higher-order probabilistic perceptron) is tested on the much-studied, open-access Breast Cancer Wisconsin (Diagnosis) Data Set of Wolberg et al. This set lists, for each tumor, measured physical parameters of the cell nuclei of each sample. The HOPP model can identify the key factors -- input features and their combinations -- most relevant for reliable diagnosis. HOPP networks were trained on 90\% of the examples in the Wisconsin database, and tested on the remaining 10\%. Referred to ensembles of 300 networks, selected randomly for cross-validation, accuracy of classification for the test sets of up to 97\% was readily achieved, with standard deviation around 2\%, together with average Matthews correlation coefficients reaching 0.94 indicating excellent predictive performance. Demonstrably, the HOPP is capable of matching the predictive power attained by other advanced machine-learning algorithms applied to this much-studied database, over several decades. Analysis shows that in this special problem, which is almost linearly separable, the effects of irreducible correlations among the measured features of the Wisconsin database are of relatively minor importance, as the Naive Bayes approximation can itself yield predictive accuracy approaching 95\%. The advantages of the HOPP algorithm will be more clearly revealed in application to more challenging machine-learning problems.


MM Algorithms for Distance Covariance based Sufficient Dimension Reduction and Sufficient Variable Selection

arXiv.org Machine Learning

Sufficient dimension reduction (SDR) using distance covariance (DCOV) was recently proposed as an approach to dimension-reduction problems. Compared with other SDR methods, it is model-free without estimating link function and does not require any particular distributions on predictors (see Sheng and Yin, 2013, 2016). However, the DCOV-based SDR method involves optimizing a nonsmooth and nonconvex objective function over the Stiefel manifold. To tackle the numerical challenge, we novelly formulate the original objective function equivalently into a DC (Difference of Convex functions) program and construct an iterative algorithm based on the majorization-minimization (MM) principle. At each step of the MM algorithm, we inexactly solve the quadratic subproblem on the Stiefel manifold by taking one iteration of Riemannian Newton's method. The algorithm can also be readily extended to sufficient variable selection (SVS) using distance covariance. We establish the convergence property of the proposed algorithm under some regularity conditions. Simulation studies show our algorithm drastically improves the computation efficiency and is robust across various settings compared with the existing method. Supplemental materials for this article are available.