Accuracy
Missing Data Imputation for Supervised Learning
Missing data imputation can help improve the performance of prediction models in situations where missing data hide useful information. This paper compares methods for imputing missing categorical data for supervised classification tasks. We experiment on two machine learning benchmark datasets with missing categorical data, comparing classifiers trained on non-imputed (i.e., one-hot encoded) or imputed data with different levels of additional missing-data perturbation. We show imputation methods can increase predictive accuracy in the presence of missing-data perturbation, which can actually improve prediction accuracy by regularizing the classifier. We achieve the state-of-the-art on the Adult dataset with missing-data perturbation and k-nearest-neighbors (k-NN) imputation.
Machine Learning of Toxicological Big Data Enables Read-Across Structure Activity Relationships (RASAR) Outperforming Animal Test Reproducibility Toxicological Sciences Oxford Academic
Earlier we created a chemical hazard database via natural language processing of dossiers submitted to the European Chemical Agency with approximately 10 000 chemicals. We identified repeat OECD guideline tests to establish reproducibility of acute oral and dermal toxicity, eye and skin irritation, mutagenicity and skin sensitization. Based on 350โ700 chemicals each, the probability that an OECD guideline animal test would output the same result in a repeat test was 78%โ96% (sensitivity 50%โ87%). An expanded database with more than 866 000 chemical properties/hazards was used as training data and to model health hazards and chemical properties. The constructed models automate and extend the read-across method of chemical classification. The novel models called RASARs (read-across structure activity relationship) use binary fingerprints and Jaccard distance to define chemical similarity. A large chemical similarity adjacency matrix is constructed from this similarity metric and is used ...
Multi-Objective Cognitive Model: a supervised approach for multi-subject fMRI analysis
Yousefnezhad, Muhammad, Zhang, Daoqiang
Neuroinform manuscript No. (will be inserted by the editor) Abstract In order to decode human brain, Multivariate Pattern (MVP) classification generates cognitive models by using functional Magnetic Resonance Imaging (fMRI) datasets. As a standard pipeline in the MVP analysis, brain patterns in multi-subject fMRI dataset must be mapped to a shared space and then a classification model is generated by employing the mapped patterns. However, the MVP models may not provide stable performance on a new fMRI dataset because the standard pipeline uses disjoint steps for generating these models. Indeed, each step in the pipeline includes an objective function with independent optimization approach, where the best solution of each step may not be optimum for the next steps. For tackling the mentioned issue, this paper introduces Multi-Objective Cognitive Model (MOCM) that utilizes an integrated objective function for MVP analysis rather than just using those disjoint steps. For solving the integrated problem, we proposed a customized multi-objective optimization approach, where all possible solutions are firstly generated, and then our method ranks and selects the robust solutions as the final results. Empirical studies confirm that the proposed method can generate superior performance in comparison with other techniques. Keywords Multi-Objective Cognitive Model ยท fMRI Analysis ยท Multivariate Pattern ยท Multi-Objective Optimization 1 Introduction One of the primary goals in neuroscience is to understand how the neural activities in the human brain can be mapped to different cognitive tasks. The authors are with the College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China. Magnetic Resonance Imaging (fMRI) data is an interdisciplinary technique.
Active Learning for Wireless IoT Intrusion Detection
Yang, Kai, Ren, Jie, Zhu, Yanqiao, Zhang, Weiyi
Internet of Things (IoT) is becoming truly ubiquitous in our everyday life, but it also faces unique security challenges. Intrusion detection is critical for the security and safety of a wireless IoT network. This paper discusses the human-in-the-loop active learning approach for wireless intrusion detection. We first present the fundamental challenges against the design of a successful Intrusion Detection System (IDS) for wireless IoT network. We then briefly review the rudimentary concepts of active learning and propose its employment in the diverse applications of wireless intrusion detection. Experimental example is also presented to show the significant performance improvement of the active learning method over traditional supervised learning approach. While machine learning techniques have been widely employed for intrusion detection, the application of human-in-the-loop machine learning that leverages both machine and human intelligence to intrusion detection of IoT is still in its infancy. We hope this article can assist the readers in understanding the key concepts of active learning and spur further research in this area.
High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking
Wang, Fan, Mukherjee, Sach, Richardson, Sylvia, Hill, Steven M.
Penalized likelihood methods are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well-developed, the relative efficacy of different methods in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical insight and guidance to users of these methods. In this paper we present a large-scale comparison of penalized regression methods. We distinguish between three related goals: prediction, variable selection and variable ranking. Our results span more than 1,800 data-generating scenarios, allowing us to systematically consider the influence of various factors (sample size, dimensionality, sparsity, signal strength and multicollinearity). We consider several widely-used methods (Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector as well as Stability Selection). We find considerable variation in performance between methods, with results dependent on details of the data-generating scenario and the specific goal. Our results support a `no panacea' view, with no unambiguous winner across all scenarios, even in this restricted setting where all data align well with the assumptions underlying the methods. Lasso is well-behaved, performing competitively in many scenarios, while SCAD is highly variable. Substantial benefits from a Ridge-penalty are only seen in the most challenging scenarios with strong multi-collinearity. The results are supported by semi-synthetic analyzes using gene expression data from cancer samples. Our empirical results complement existing theory and provide a resource to compare methods across a range of scenarios and metrics.
The impact of imbalanced training data on machine learning for author name disambiguation
In supervised machine learning for author name disambiguation, negative training data are often dominantly larger than positive training data. This paper examines how the ratios of negative to positive training data can affect the performance of machine learning algorithms to disambiguate author names in bibliographic records. On multiple labeled datasets, three classifiers - Logistic Regression, Na\"ive Bayes, and Random Forest - are trained through representative features such as coauthor names, and title words extracted from the same training data but with various positive-negative training data ratios. Results show that increasing negative training data can improve disambiguation performance but with a few percent of performance gains and sometimes degrade it. Logistic Regression and Na\"ive Bayes learn optimal disambiguation models even with a base ratio (1:1) of positive and negative training data. Also, the performance improvement by Random Forest tends to quickly saturate roughly after 1:10 ~ 1:15. These findings imply that contrary to the common practice using all training data, name disambiguation algorithms can be trained using part of negative training data without degrading much disambiguation performance while increasing computational efficiency. This study calls for more attention from author name disambiguation scholars to methods for machine learning from imbalanced data.
Supervised classification for object identification in urban areas using satellite imagery
Ali, Hazrat, Awan, Adnan Ali, Khan, Sanaullah, Shafique, Omer, Rahman, Atiq ur, Khan, Shahid
This paper presents a useful method to achieve classification in satellite imagery. The approach is based on pixel level study employing various features such as correlation, homogeneity, energy and contrast. In this study gray-scale images are used for training the classification model. For supervised classification, two classification techniques are employed namely the Support Vector Machine (SVM) and the Naive Bayes. With textural features used for gray-scale images, Naive Bayes performs better with an overall accuracy of 76% compared to 68% achieved by SVM. The computational time is evaluated while performing the experiment with two different window sizes i.e., 50x50 and 70x70. The required computational time on a single image is found to be 27 seconds for a window size of 70x70 and 45 seconds for a window size of 50x50.
Make "Fairness by Design" Part of Machine Learning
Machine learning is increasingly being used to predict individuals' attitudes, behaviors, and preferences across an array of applications -- from personalized marketing to precision medicine. Unsurprisingly, given the speed of change and ever-increasing complexity, there have been several recent high-profile examples of "machine learning gone wrong." A chatbot trained using Twitter was shut down after only a single day because of its obscene and inflammatory tweets. Machine learning models used in a popular search engine struggle to differentiate human images from those of gorillas, and show female searchers ads for lower paying jobs relative to male users. More recently, a study compared the commonly used crime risk analysis tool COMPAS against recidivism predictions from 400 untrained workers recruited via Amazon Mechanical Turk.
Anomaly Detection via Minimum Likelihood Generative Adversarial Networks
Wang, Chu, Zhang, Yan-Ming, Liu, Cheng-Lin
Anomaly detection aims to detect abnormal events by a model of normality. It plays an important role in many domains such as network intrusion detection, criminal activity identity and so on. With the rapidly growing size of accessible training data and high computation capacities, deep learning based anomaly detection has become more and more popular. In this paper, a new domain-based anomaly detection method based on generative adversarial networks (GAN) is proposed. Minimum likelihood regularization is proposed to make the generator produce more anomalies and prevent it from converging to normal data distribution. Proper ensemble of anomaly scores is shown to improve the stability of discriminator effectively. The proposed method has achieved significant improvement than other anomaly detection methods on Cifar10 and UCI datasets.
Open Category Detection with PAC Guarantees
Liu, Si, Garrepalli, Risheek, Dietterich, Thomas G., Fern, Alan, Hendrycks, Dan
Open category detection is the problem of detecting "alien" test instances that belong to categories or classes that were not present in the training data. In many applications, reliably detecting such aliens is central to ensuring the safety and accuracy of test set predictions. Unfortunately, there are no algorithms that provide theoretical guarantees on their ability to detect aliens under general assumptions. Further, while there are algorithms for open category detection, there are few empirical results that directly report alien detection rates. Thus, there are significant theoretical and empirical gaps in our understanding of open category detection. In this paper, we take a step toward addressing this gap by studying a simple, but practically-relevant variant of open category detection. In our setting, we are provided with a "clean" training set that contains only the target categories of interest and an unlabeled "contaminated" training set that contains a fraction $\alpha$ of alien examples. Under the assumption that we know an upper bound on $\alpha$, we develop an algorithm with PAC-style guarantees on the alien detection rate, while aiming to minimize false alarms. Empirical results on synthetic and standard benchmark datasets demonstrate the regimes in which the algorithm can be effective and provide a baseline for further advancements.