Goto

Collaborating Authors

 Performance Analysis


Evaluating Classification Models, Part 3

#artificialintelligence

This series differs from other discussions of evaluation metrics for classification models in that it aims to provide a systematic perspective. Rather than providing a laundry list of individual metrics, it situates those metrics within a fairly comprehensive family and explains how you can choose a member of that family that is appropriate for your use case. This post explains how the three weighted "Pythagorean means" (arithmetic, geometric, and harmonic) of precision and recall encode preferences over models. Suppose we build two different models, and one has better precision while the other has better recall. To choose between these models, we need to decide whether the gain from 90.8% precision to 91.5% precision that we get by going from Model A to Model B is enough to offset a loss from 99% recall to 97% recall.


Have Unbalanced Classes? Try Significant Terms

#artificialintelligence

The words that are significant to a class can be used improve the precision-recall trade off in classification. And it is tougher (sorry Yogi!) when the target classes to predict have widely varying supports. But that does happen often with real world datasets. Case in point is the prediction of a near future CCU readmission of a patient based on a discharge note. Only a small fraction of patients get readmitted to CCU within 30 days of a discharge. Our analysis of MIMIC-III dataset in the previous post showed that over 93% of the patients did not require readmission.


A US government study confirms most face recognition systems are racist

#artificialintelligence

Almost 200 face recognition algorithms--a majority in the industry--had worse performance on nonwhite faces, according to a landmark study. What they tested: The US National Institute of Standards and Technology (NIST) tested every algorithm on two of the most common tasks for face recognition. The first, known as "one-to-one" matching, involves matching a photo of someone to another photo of the same person in a database. This is used to unlock smartphones or check passports, for example. The second, known as "one-to-many" searching, involves determining whether a photo of someone has any match in a database.


Machine learning and its applications in plant molecular studies

#artificialintelligence

The advent of high-throughput genomic technologies has resulted in the accumulation of massive amounts of genomic information. However, biologists are challenged with how to effectively analyze these data. Machine learning can provide tools for better and more efficient data analysis. Unfortunately, because many plant biologists are unfamiliar with machine learning, its application in plant molecular studies has been restricted to a few species and a limited set of algorithms. Thus, in this study, we provide the basic steps for developing machine learning frameworks and present a comprehensive overview of machine learning algorithms and various evaluation metrics. Furthermore, we introduce sources of important curated plant genomic data and R packages to enable plant biologists to easily and quickly apply appropriate machine learning algorithms in their research. Finally, we discuss current applications of machine learning algorithms for identifying various genes related to resistance to biotic and abiotic stress. Broad application of machine learning and the accumulation of plant sequencing data will advance plant molecular studies. The advent of high-throughput sequencing technologies has produced several large-scale data sets. This enormous amount of information enables biologists to explore topics that were once difficult or impossible to investigate, such as associations between microRNA and certain diseases, the causes of vascular inflammation and atherosclerosis in humans [1โ€“3] and stress breeding in plants [4]. However, many challenges have also emerged. For example, the European Bioinformatics Institute now stores 273 petabytes of raw molecular data on humans, plants and animals (https://www.ebi.ac.uk/).


On Sharing Models Instead of Data using Mimic learning for Smart Health Applications

arXiv.org Machine Learning

On Sharing Models Instead of Data using Mimic learning for Smart Health Applications Mohamed Baza, Andrew Salazar โ€ , Mohamed Mahmoud, Mohamed Abdallah โ€ก, Kemal Akkaya โ€ก Department of Computer Science, Tennessee Tech University, Cookeville, TN, USA โ€ก Department of Information and Decision Sciences, California State San Bernardino, San Bernardino, CA, USA โ€ก division of Information and Computing Technology, College of Science and Engineering, HBKU, Doha, Qatar ยง Department of Electrical and Computer Engineering, Florida International University, Miami, FL, USA Abstract --Electronic health records (EHR) systems contain vast amounts of medical information about patients. These data can be used to train machine learning models that can predict health status, as well as to help prevent future diseases or disabilities. However, getting patients' medical data to obtain well-trained machine learning models is a challenging task. This is because sharing the patients' medical records is prohibited by law in most countries due to patients privacy concerns. In this paper, we tackle this problem by sharing the models instead of the original sensitive data by using the mimic learning approach. The idea is first to train a model on the original sensitive data, called the teacher model. Then, using this model, we can transfer its knowledge to another model, called the student model, without the need to learn the original data used in training the teacher model.


A Study of the Learnability of Relational Properties (Model Counting Meets Machine Learning)

arXiv.org Artificial Intelligence

Relational properties, e.g., the connectivity structure of nodes in a distributed system, have many applications in software design and analysis. However, such properties often have to be written manually, which can be costly and error-prone. This paper introduces the MCML approach for empirically studying the learnability of a key class of such properties that can be expressed in the well-known software design language Alloy. A key novelty of MCML is quantification of the performance of and semantic differences among trained machine learning (ML) models, specifically decision trees, with respect to entire input spaces (up to a bound on the input size), and not just for given training and test datasets (as is the common practice). MCML reduces the quantification problems to the classic complexity theory problem of model counting, and employs state-of-the-art approximate and exact model counters for high efficiency. The results show that relatively simple ML models can achieve surprisingly high performance (accuracy and F1 score) at learning relational properties when evaluated in the common setting of using training and test datasets -- even when the training dataset is much smaller than the test dataset -- indicating the seeming simplicity of learning these properties. However, the use of MCML metrics based on model counting shows that the performance can degrade substantially when tested against the whole (bounded) input space, indicating the high complexity of precisely learning these properties, and the usefulness of model counting in quantifying the true accuracy.


AI improves breast cancer risk prediction

#artificialintelligence

Most existing breast cancer screening programs are based on mammography at similar time intervals -- typically, annually or every two years -- for all women. This "one size fits all" approach is not optimized for cancer detection on an individual level and may hamper the effectiveness of screening programs. "Risk prediction is an important building block of an individually adapted screening policy," said study lead author Karin Dembrower, M.D., breast radiologist and Ph.D. candidate from the Karolinska Institute in Stockholm, Sweden. "Effective risk prediction can improve attendance and confidence in screening programs." High breast density, or a greater amount of glandular and connective tissue compared to fat, is considered a risk factor for cancer.


Improving drug response prediction by integrating multiple data sources: matrix factorization, kernel and network-based approaches

#artificialintelligence

Note: MF Matrix factorization; BMF Bayesian matrix factorization; KBMF Kernel Bayesian matrix factorization; KRR Kernel ridge regression; NBR Network based regression; NBC Network based classification; CV Cross validation; LOOCV Leave-one-out cross validation; PCC Pearson correlation coefficient; RMSE Root mean square error; MSE Mean square error; SCC Spearman correlation coefficient; NDCG Normalized discounted cumulative gain; R2 Coefficient of determination; NRMSE Normalized root mean squared error; AUC Area under curve; PPI Proteinโ€“protein interaction.


EnsemFDet: An Ensemble Approach to Fraud Detection based on Bipartite Graph

arXiv.org Machine Learning

Fraud detection is extremely critical for e-commerce business. It is the intent of the companies to detect and prevent fraud as early as possible. Existing fraud detection methods try to identify unexpected dense subgraphs and treat related nodes as suspicious. Spectral relaxation-based methods solve the problem efficiently but hurt the performance due to the relaxed constraints. Besides, many methods cannot be accelerated with parallel computation or control the number of returned suspicious nodes because they provide a set of subgraphs with diverse node sizes. These drawbacks affect the real-world applications of existing methods. In this paper, we propose an Ensemble-based Fraud Detection (EnsemFDet) method to scale up fraud detection in bipartite graphs by decomposing the original problem into subproblems on small-sized subgraphs. By oversampling the graph and solving the subproblems, the ensemble approach further votes suspicious nodes without sacrificing the prediction accuracy. Extensive experiments have been done on real transaction data from JD.com, which is one of the world's largest e-commerce platforms. Experimental results demonstrate the effectiveness, practicability, and scalability of EnsemFDet. More specifically, EnsemFDet is up to 100x faster than the state-of-the-art methods due to its parallelism with all aspects of data.


Privacy Attacks on Network Embeddings

arXiv.org Machine Learning

Data ownership and data protection are increasingly important topics with ethical and legal implications, e.g., with the right to erasure established in the European General Data Protection Regulation (GDPR). In this light, we investigate network embeddings, i.e., the representation of network nodes as low-dimensional vectors. We consider a typical social network scenario with nodes representing users and edges relationships between them. We assume that a network embedding of the nodes has been trained. After that, a user demands the removal of his data, requiring the full deletion of the corresponding network information, in particular the corresponding node and incident edges. In that setting, we analyze whether after the removal of the node from the network and the deletion of the vector representation of the respective node in the embedding significant information about the link structure of the removed node is still encoded in the embedding vectors of the remaining nodes. This would require a (potentially computationally expensive) retraining of the embedding. For that purpose, we deploy an attack that leverages information from the remaining network and embedding to recover information about the neighbors of the removed node. The attack is based on (i) measuring distance changes in network embeddings and (ii) a machine learning classifier that is trained on networks that are constructed by removing additional nodes. Our experiments demonstrate that substantial information about the edges of a removed node/user can be retrieved across many different datasets. This implies that to fully protect the privacy of users, node deletion requires complete retraining - or at least a significant modification - of original network embeddings. Our results suggest that deleting the corresponding vector representation from network embeddings alone is not sufficient from a privacy perspective.