Performance Analysis
Multi-Kernel LS-SVM Based Bio-Clinical Data Integration: Applications to Ovarian Cancer
The medical research facilitates to acquire a diverse type of data from the same individual for particular cancer. Recent studies show that utilizing such diverse data results in more accurate predictions. The major challenge faced is how to utilize such diverse data sets in an effective way. In this paper, we introduce a multiple kernel based pipeline for integrative analysis of high-throughput molecular data (somatic mutation, copy number alteration, DNA methylation and mRNA) and clinical data. We apply the pipeline on Ovarian cancer data from TCGA. After multiple kernels have been generated from the weighted sum of individual kernels, it is used to stratify patients and predict clinical outcomes. We examine the survival time, vital status, and neoplasm cancer status of each subtype to verify how well they cluster. We have also examined the power of molecular and clinical data in predicting dichotomized overall survival data and to classify the tumor grade for the cancer samples. It was observed that the integration of various data types yields higher log-rank statistics value. We were also able to predict clinical status with higher accuracy as compared to using individual data types.
False Positives Are a True Negative: Using Machine Learning to Improve Accuracy
Machine learning has grown to be one of the most popular and powerful tools in the quest to secure systems. Some approaches to machine learning have yielded overly aggressive models that demonstrate remarkable predictive accuracy, yet give way to false positives. False positives create negative user experiences that prevent new protection from deploying. IT personnel also find these false alarms disruptive when they are working to detect and eliminate malware. The Ponemon Institute recently reported that over 20 percent of endpoint security investigation spending was wasted on these false alarms.
Lagged Exact Bayesian Online Changepoint Detection
Byrd, Michael, Nghiem, Linh, Cao, Jing
Identifying changes in the generative process of sequential data, known as changepoint detection, has become an increasingly important topic for a wide variety of fields. A recently developed approach, which we call EXact Online Bayesian Changepoint Detection (EXO), has shown reasonable results with efficient computation for real time updates. However, when the changes are relatively small, EXO starts to have difficulty in detecting changepoints accurately. We propose a new algorithm called $\ell$-Lag EXact Online Bayesian Changepoint Detection (LEXO-$\ell$), which improves the accuracy of the detection by incorporating $\ell$ time lags in the inference. We prove that LEXO-1 finds the exact posterior distribution for the current run length and can be computed efficiently, with extension to arbitrary lag. Additionally, we show that LEXO-1 performs better than EXO in an extensive simulation study; this study is extended to higher order lags to illustrate the performance of the generalized methodology. Lastly, we illustrate applicability with two real world data examples comparing EXO and LEXO-1.
On formalizing fairness in prediction with machine learning
Machine learning algorithms for prediction are increasingly being used in critical decisions affecting human lives. Various fairness formalizations, with no firm consensus yet, are employed to prevent such algorithms from systematically discriminating against people based on certain attributes protected by law. The aim of this article is to survey how fairness is formalized in the machine learning literature for the task of prediction and present these formalizations with their corresponding notions of distributive justice from the social sciences literature. We provide theoretical as well as empirical critiques of these notions from the social sciences literature and explain how these critiques limit the suitability of the corresponding fairness formalizations to certain domains. We also suggest two notions of distributive justice which address some of these critiques and discuss avenues for prospective fairness formalizations.
Considerations of automated machine learning in clinical metabolic profiling: Altered homocysteine plasma concentration associated with metformin exposure
Orlenko, Alena, Moore, Jason H., Orzechowski, Patryk, Olson, Randal S., Cairns, Junmei, Caraballo, Pedro J., Weinshilboum, Richard M., Wang, Liewei, Breitenstein, Matthew K.
With the maturation of metabolomics science and proliferation of biobanks, clinical metabolic profiling is an increasingly opportunistic frontier for advancing translational clinical research. Automated Machine Learning (AutoML) approaches provide exciting opportunity to guide feature selection in agnostic metabolic profiling endeavors, where potentially thousands of independent data points must be evaluated. In previous research, AutoML using high-dimensional data of varying types has been demonstrably robust, outperforming traditional approaches. However, considerations for application in clinical metabolic profiling remain to be evaluated. Particularly, regarding the robustness of AutoML to identify and adjust for common clinical confounders. In this study, we present a focused case study regarding AutoML considerations for using the Tree-Based Optimization Tool (TPOT) in metabolic profiling of exposure to metformin in a biobank cohort. First, we propose a tandem rank-accuracy measure to guide agnostic feature selection and corresponding threshold determination in clinical metabolic profiling endeavors. Second, while AutoML, using default parameters, demonstrated potential to lack sensitivity to low-effect confounding clinical covariates, we demonstrated residual training and adjustment of metabolite features as an easily applicable approach to ensure AutoML adjustment for potential confounding characteristics. Finally, we present increased homocysteine with long-term exposure to metformin as a potentially novel, non-replicated metabolite association suggested by TPOT; an association not identified in parallel clinical metabolic profiling endeavors. While considerations are recommended, including adjustment approaches for clinical confounders, AutoML presents an exciting tool to enhance clinical metabolic profiling and advance translational research endeavors.
Structural Feature Selection for Event Logs
Hinkka, Markku, Lehto, Teemu, Heljanko, Keijo, Jung, Alexander
We consider the problem of classifying business process instances based on structural features derived from event logs. The main motivation is to provide machine learning based techniques with quick response times for interactive computer assisted root cause analysis. In particular, we create structural features from process mining such as activity and transition occurrence counts, and ordering of activities to be evaluated as potential features for classification. We show that adding such structural features increases the amount of information thus potentially increasing classification accuracy. However, there is an inherent trade-off as using too many features leads to too long run-times for machine learning classification models. One way to improve the machine learning algorithms' run-time is to only select a small number of features by a feature selection algorithm. However, the run-time required by the feature selection algorithm must also be taken into account. Also, the classification accuracy should not suffer too much from the feature selection. The main contributions of this paper are as follows: First, we propose and compare six different feature selection algorithms by means of an experimental setup comparing their classification accuracy and achievable response times. Second, we discuss the potential use of feature selection results for computer assisted root cause analysis as well as the properties of different types of structural features in the context of feature selection.
Machine Unlearning: The Value of Imperfect Models
A project manager once told me that "any job worth doing is worth doing poorly." I understood exactly what she meant, and she knew that I would understand, especially when she preceded our conversation with these words: "I wouldn't say this to everyone, but I know you will understand what I mean." The message was clear to me because I was a perfectionist (and hopefully I have learned over the years to be less of a perfectionist thanks to my project manager's wise counsel). As a perfectionist, I would strive for 100% completion and perfection on every project, every analysis, and every report. It would take me longer than most people to finish the analysis and report, and my manager understood why.
A Large Self-Annotated Corpus for Sarcasm
Khodak, Mikhail, Saunshi, Nikunj, Vodrahalli, Kiran
We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in regimes of both balanced and unbalanced labels. Each statement is furthermore self-annotated -- sarcasm is labeled by the author and not an independent annotator -- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, compare it to previous related corpora, and provide baselines for the task of sarcasm detection.
Anatomical Pattern Analysis for decoding visual stimuli in human brains
Yousefnezhad, Muhammad, Zhang, Daoqiang
Background: A universal unanswered question in neuroscience and machine learning is whether computers can decode the patterns of the human brain. Multi-Voxels Pattern Analysis (MVPA) is a critical tool for addressing this question. However, there are two challenges in the previous MVPA methods, which include decreasing sparsity and noise in the extracted features and increasing the performance of prediction. Methods: In overcoming mentioned challenges, this paper proposes Anatomical Pattern Analysis (APA) for decoding visual stimuli in the human brain. This framework develops a novel anatomical feature extraction method and a new imbalance AdaBoost algorithm for binary classification. Further, it utilizes an Error-Correcting Output Codes (ECOC) method for multiclass prediction. APA can automatically detect active regions for each category of the visual stimuli. Moreover, it enables us to combine homogeneous datasets for applying advanced classification. Results and Conclusions: Experimental studies on 4 visual categories (words, consonants, objects and scrambled photos) demonstrate that the proposed approach achieves superior performance to state-of-the-art methods.