Accuracy
Association of Pathological Fibrosis With Renal Survival Using Deep Neural Networks
Color to a set of bars within the histogram was assigned based on the Kidney Disease Outcomes Quality Initiative (KDOQI) guideline driven cutoff values for high and low creatinine. Model predictions were performed on the remaining 30% of the data (n 662), and a receiver operating characteristic (ROC) curve was generated. Color to a set of bars within the histogram was assigned based on the KDOQI guideline driven cutoff value for nephrotic-range proteinuria (g/d). Model predictions were performed on the remaining 30% of the data (n 648), and an ROC curve was generated.
Convex Formulations for Fair Principal Component Analysis
Though there is a growing body of literature on fairness for supervised learning, the problem of incorporating fairness into unsupervised learning has been less well-studied. This paper studies fairness in the context of principal component analysis (PCA). We first present a definition of fairness for dimensionality reduction, and our definition can be interpreted as saying that a reduction is fair if information about a protected class (e.g., race or gender) cannot be inferred from the dimensionality-reduced data points. Next, we develop convex optimization formulations that can improve the fairness (with respect to our definition) of PCA and kernel PCA. These formulations are semidefinite programs (SDP's), and we demonstrate the effectiveness of our formulations using several datasets. We conclude by showing how our approach can be used to perform a fair (with respect to age) clustering of health data that may be used to set health insurance rates.
PCA-Based Missing Information Imputation for Real-Time Crash Likelihood Prediction Under Imbalanced Data
Ke, Jintao, Zhang, Shuaichao, Yang, Hai, Chen, Xiqun
The real-time crash likelihood prediction has been an important research topic. Various classifiers, such as support vector machine (SVM) and tree-based boosting algorithms, have been proposed in traffic safety studies. However, few research focuses on the missing data imputation in real-time crash likelihood prediction, although missing values are commonly observed due to breakdown of sensors or external interference. Besides, classifying imbalanced data is also a difficult problem in real-time crash likelihood prediction, since it is hard to distinguish crash-prone cases from non-crash cases which compose the majority of the observed samples. In this paper, principal component analysis (PCA) based approaches, including LS-PCA, PPCA, and VBPCA, are employed for imputing missing values, while two kinds of solutions are developed to solve the problem in imbalanced data. The results show that PPCA and VBPCA not only outperform LS-PCA and other imputation methods (including mean imputation and k-means clustering imputation), in terms of the root mean square error (RMSE), but also help the classifiers achieve better predictive performance. The two solutions, i.e., cost-sensitive learning and synthetic minority oversampling technique (SMOTE), help improve the sensitivity by adjusting the classifiers to Corresponding author Email address: chenxiqun@zju.edu.cn Keywords: Real-time crash likelihood prediction, PCA-based missing data imputation, cost-sensitive learning, SMOTE, support vector machine, AdaBoost 1. Introduction Prediction of traffic crash has been a major research topic in transportation safety studies. Crashes, especially on urban expressways, can trigger heavy traffic congestions, impose huge external costs, and reduce the level of service of transportation infrastructures. Therefore, the accurate and reliable prediction of crash risks is critical to the success of proactive safety management strategies on urban expressways. There have been fruitful studies in the domain of the real-time crash likelihood estimation (Abdel-Aty and Pemmanaboina, 2006; Abdel-Aty et al., 2007, 2008; Ahmed and Abdel-Aty, 2012). It has been reported that crash occurrence was affected by four major factors: real-time traffic state, drivers' behavior, environment factors, and road geometry (Ahmed and Abdel-Aty, 2013b).
Recovering Loss to Followup Information Using Denoising Autoencoders
Imagine this scenario: In a clinical trial investigating the toxicity of a new chemotherapy drug to treat breast cancer, some patients drop out of the trial before completion for various reasons, hence we do not have the data for final outcome on the dropped out patients. What if the patients who drop out of the trial before completion are the ones who experienced toxicity and are unwilling to continue the treatment, this reason however is not recorded in the database and the patients are marked as "lost to followup". If the investigators were to analyze the data using conventional methods where loss to followup is ignored and not properly accounted for, they will estimate the toxicity to be far less than what it really is. These results can lead to adapting a drug, that is otherwise unsafe. Similarly if patients who are feeling better dropout of the trial before completion, the estimates of toxicity would be far greater than the real value, leading to rejection of a potential lifesaver drug.
Crit\`eres de qualit\'e d'un classifieur g\'en\'eraliste
This paper considers the problem of choosing a good classifier. For each problem there exist an optimal classifier, but none are optimal, regarding the error rate, in all cases. Because there exists a large number of classifiers, a user would rather prefer an all-purpose classifier that is easy to adjust, in the hope that it will do almost as good as the optimal. In this paper we establish a list of criteria that a good generalist classifier should satisfy . We first discuss data analytic, these criteria are presented. Six among the most popular classifiers are selected and scored according to these criteria. Tables allow to easily appreciate the relative values of each. In the end, random forests turn out to be the best classifiers.
Enhanced version of AdaBoostM1 with J48 Tree learning method
Kang, Kyongche, Michalak, Jack
Machine Learning focuses on the construction and study of systems that can learn from data. This is connected with the classification problem, which usually is what Machine Learning algorithms are designed to solve. When a machine learning method is used by people with no special expertise in machine learning, it is important that the method be'robust' in classification, in the sense that reasonable performance is obtained with minimal tuning of the problem at hand. Algorithms are evaluated based on how'robust' they can classify the given data. In this paper, we propose a quantifiable measure of'robustness', and describe a particular learning method that is robust according to this measure in the context of classification problem. We proposed Adaptive Boosting (AdaBoostM1) with J48(C4.5 tree) as a base learner with tuning weight threshold (P) and number of iterations (I) for boosting algorithm. To benchmark the performance, we used the baseline classifier, AdaBoostM1 with Decision Stump as base learner without tuning parameters. By tuning parameters and using J48 as base learner, we are able to reduce the overall average error rate ratio (errorC/errorNB) from 2.4 to 0.9 for development sets of data and 2.1 to 1.2 for evaluation sets of data.
Adversarial Metric Learning
Chen, Shuo, Gong, Chen, Yang, Jian, Li, Xiang, Wei, Yang, Li, Jun
In the past decades, intensive efforts have been put to design various loss functions and metric forms for metric learning problem. These improvements have shown promising results when the test data is similar to the training data. However, the trained models often fail to produce reliable distances on the ambiguous test pairs due to the distribution bias between training set and test set. To address this problem, the Adversarial Metric Learning (AML) is proposed in this paper, which automatically generates adversarial pairs to remedy the distribution bias and facilitate robust metric learning. Specifically, AML consists of two adversarial stages, i.e. confusion and distinguishment. In confusion stage, the ambiguous but critical adversarial data pairs are adaptively generated to mislead the learned metric. In distinguishment stage, a metric is exhaustively learned to try its best to distinguish both the adversarial pairs and the original training pairs. Thanks to the challenges posed by the confusion stage in such competing process, the AML model is able to grasp plentiful difficult knowledge that has not been contained by the original training pairs, so the discriminability of AML can be significantly improved. The entire model is formulated into optimization framework, of which the global convergence is theoretically proved. The experimental results on toy data and practical datasets clearly demonstrate the superiority of AML to the representative state-of-the-art metric learning methodologies.
Brain EEG Time Series Selection: A Novel Graph-Based Approach for Classification
Dai, Chenglong, Wu, Jia, Pi, Dechang, Cui, Lin
Brain Electroencephalography (EEG) classification is widely applied to analyze cerebral diseases in recent years. Unfortunately, invalid/noisy EEGs degrade the diagnosis performance and most previously developed methods ignore the necessity of EEG selection for classification. To this end, this paper proposes a novel maximum weight clique-based EEG selection approach, named mwcEEGs, to map EEG selection to searching maximum similarity-weighted cliques from an improved Fr\'{e}chet distance-weighted undirected EEG graph simultaneously considering edge weights and vertex weights. Our mwcEEGs improves the classification performance by selecting intra-clique pairwise similar and inter-clique discriminative EEGs with similarity threshold $\delta$. Experimental results demonstrate the algorithm effectiveness compared with the state-of-the-art time series selection algorithms on real-world EEG datasets.
Concept Drift and Anomaly Detection in Graph Streams
Zambon, Daniele, Alippi, Cesare, Livi, Lorenzo
Graph representations offer powerful and intuitive ways to describe data in a multitude of application domains. Here, we consider stochastic processes generating graphs and propose a methodology for detecting changes in stationarity of such processes. The methodology is general and considers a process generating attributed graphs with a variable number of vertices/edges, without the need to assume one-to-one correspondence between vertices at different time steps. The methodology acts by embedding every graph of the stream into a vector domain, where a conventional multivariate change detection procedure can be easily applied. We ground the soundness of our proposal by proving several theoretical results. In addition, we provide a specific implementation of the methodology and evaluate its effectiveness on several detection problems involving attributed graphs representing biological molecules and drawings. Experimental results are contrasted with respect to suitable baseline methods, demonstrating the effectiveness of our approach.
Relational Autoencoder for Feature Extraction
Meng, Qinxue, Catchpoole, Daniel, Skillicorn, David, Kennedy, Paul J.
Feature extraction becomes increasingly important as data grows high dimensional. Autoencoder as a neural network based feature extraction method achieves great success in generating abstract features of high dimensional data. However, it fails to consider the relationships of data samples which may affect experimental results of using original and new features. In this paper, we propose a Relation Autoencoder model considering both data features and their relationships. We also extend it to work with other major autoencoder models including Sparse Autoencoder, Denoising Autoencoder and Variational Autoencoder. The proposed relational autoencoder models are evaluated on a set of benchmark datasets and the experimental results show that considering data relationships can generate more robust features which achieve lower construction loss and then lower error rate in further classification compared to the other variants of autoencoders.