Accuracy
Graph Based Link Prediction between Human Phenotypes and Genes
Background: The learning of genotype-phenotype associations and history of human disease by doing detailed and precise analysis of phenotypic abnormalities can be defined as deep phenotyping. To understand and detect this interaction between phenotype and genotype is a fundamental step when translating precision medicine to clinical practice. The recent advances in the field of machine learning is efficient to predict these interactions between abnormal human phenotypes and genes. Methods: In this study, we developed a framework to predict links between human phenotype ontology (HPO) and genes. The annotation data from the heterogeneous knowledge resources i.e., orphanet, is used to parse human phenotype-gene associations. To generate the embeddings for the nodes (HPO & genes), an algorithm called node2vec was used. It performs node sampling on this graph based on random walks, then learns features over these sampled nodes to generate embeddings. These embeddings were used to perform the downstream task to predict the presence of the link between these nodes using 5 different supervised machine learning algorithms. Results: The downstream link prediction task shows that the Gradient Boosting Decision Tree based model (LightGBM) achieved an optimal AUROC 0.904 and AUCPR 0.784. In addition, LightGBM achieved an optimal weighted F1 score of 0.87. Compared to the other 4 methods LightGBM is able to find more accurate interaction/link between human phenotype & gene pairs.
Sequential Domain Adaptation by Synthesizing Distributionally Robust Experts
Taskesen, Bahar, Yue, Man-Chung, Blanchet, Jose, Kuhn, Daniel, Nguyen, Viet Anh
Least squares estimators, when trained on a few target domain samples, may predict poorly. Supervised domain adaptation aims to improve the predictive accuracy by exploiting additional labeled training samples from a source distribution that is close to the target distribution. Given available data, we investigate novel strategies to synthesize a family of least squares estimator experts that are robust with regard to moment conditions. When these moment conditions are specified using Kullback-Leibler or Wasserstein-type divergences, we can find the robust estimators efficiently using convex optimization. We use the Bernstein online aggregation algorithm on the proposed family of robust experts to generate predictions for the sequential stream of target test samples. Numerical experiments on real data show that the robust strategies may outperform non-robust interpolations of the empirical least squares estimators.
Fair-Net: A Network Architecture For Reducing Performance Disparity Between Identifiable Sub-Populations
Datta, Arghya, Swamidass, S. Joshua
In real world datasets, particular groups are under-represented, much rarer than others, and machine learning classifiers will often preform worse on under-represented populations. This problem is aggravated across many domains where datasets are class imbalanced, with a minority class far rarer than the majority class. Naive approaches to handle under-representation and class imbalance include training sub-population specific classifiers that handle class imbalance or training a global classifier that overlooks sub-population disparities and aims to achieve high overall accuracy by handling class imbalance. In this study, we find that these approaches are vulnerable in class imbalanced datasets with minority sub-populations. We introduced Fair-Net, a branched multitask neural network architecture that improves both classification accuracy and probability calibration across identifiable sub-populations in class imbalanced datasets. Fair-Nets is a straightforward extension to the output layer and error function of a network, so can be incorporated in far more complex architectures. Empirical studies with three real world benchmark datasets demonstrate that Fair-Net improves classification and calibration performance, substantially reducing performance disparity between gender and racial sub-populations.
Pricing Algorithmic Insurance
Bertsimas, Dimitris, Orfanoudaki, Agni
As machine learning algorithms start to get integrated into the decision-making process of companies and organizations, insurance products will be developed to protect their owners from risk. We introduce the concept of algorithmic insurance and present a quantitative framework to enable the pricing of the derived insurance contracts. We propose an optimization formulation to estimate the risk exposure and price for a binary classification model. Our approach outlines how properties of the model, such as accuracy, interpretability and generalizability, can influence the insurance contract evaluation. To showcase a practical implementation of the proposed framework, we present a case study of medical malpractice in the context of breast cancer detection. Our analysis focuses on measuring the effect of the model parameters on the expected financial loss and identifying the aspects of algorithmic performance that predominantly affect the price of the contract.
Validating GAN-BioBERT: A Methodology For Assessing Reporting Trends In Clinical Trials
Myszewski, Joshua J, Klossowski, Emily, Meyer, Patrick, Bevil, Kristin, Klesius, Lisa, Schroeder, Kristopher M
In the past decade, there has been much discussion about the issue of biased reporting in clinical research. Despite this attention, there have been limited tools developed for the systematic assessment of qualitative statements made in clinical research, with most studies assessing qualitative statements relying on the use of manual expert raters, which limits their size. Also, previous attempts to develop larger scale tools, such as those using natural language processing, were limited by both their accuracy and the number of categories used for the classification of their findings. With these limitations in mind, this study's goal was to develop a classification algorithm that was both suitably accurate and finely grained to be applied on a large scale for assessing the qualitative sentiment expressed in clinical trial abstracts. Additionally, this study seeks to compare the performance of the proposed algorithm, GAN-BioBERT, to previous studies as well as to expert manual rating of clinical trial abstracts. This study develops a three-class sentiment classification algorithm for clinical trial abstracts using a semi-supervised natural language process model based on the Bidirectional Encoder Representation from Transformers (BERT) model, from a series of clinical trial abstracts annotated by a group of experts in academic medicine. Results: The use of this algorithm was found to have a classification accuracy of 91.3%, with a macro F1-Score of 0.92, which is a significant improvement in accuracy when compared to previous methods and expert ratings, while also making the sentiment classification finer grained than previous studies. The proposed algorithm, GAN-BioBERT, is a suitable classification model for the large-scale assessment of qualitative statements in clinical trial literature, providing an accurate, reproducible tool for the large-scale study of clinical publication trends.
Predict Lead Score (the right way) using PyCaret
Leads are the driving force of many businesses today. With the advancement of subscription-based business models particularly in the start-up space, the ability to convert leads into paying customers is key to survival. In simple terms, a "lead" represents a potential customer interested in buying your product/service. A significant amount of time, money, and effort is spent by marketing and sales departments on lead management, a concept that we will take to encompass the three key phases of lead generation, qualification, and monetization. Lead generation is the initiation of customer interest or inquiry into the products or services of your business.
Gradient-based Data Subversion Attack Against Binary Classifiers
Vasu, Rosni K, Seetharaman, Sanjay, Malaviya, Shubham, Shukla, Manish, Lodha, Sachin
Machine learning based data-driven technologies have shown impressive performances in a variety of application domains. Most enterprises use data from multiple sources to provide quality applications. The reliability of the external data sources raises concerns for the security of the machine learning techniques adopted. An attacker can tamper the training or test datasets to subvert the predictions of models generated by these techniques. Data poisoning is one such attack wherein the attacker tries to degrade the performance of a classifier by manipulating the training data. In this work, we focus on label contamination attack in which an attacker poisons the labels of data to compromise the functionality of the system. We develop Gradient-based Data Subversion strategies to achieve model degradation under the assumption that the attacker has limited-knowledge of the victim model. We exploit the gradients of a differentiable convex loss function (residual errors) with respect to the predicted label as a warm-start and formulate different strategies to find a set of data instances to contaminate. Further, we analyze the transferability of attacks and the susceptibility of binary classifiers. Our experiments show that the proposed approach outperforms the baselines and is computationally efficient.
Semi-orthogonal Embedding for Efficient Unsupervised Anomaly Segmentation
Kim, Jin-Hwa, Kim, Do-Hyeong, Yi, Saehoon, Lee, Taehoon
We present the efficiency of semi-orthogonal embedding for unsupervised anomaly segmentation. The multi-scale features from pre-trained CNNs are recently used for the localized Mahalanobis distances with significant performance. However, the increased feature size is problematic to scale up to the bigger CNNs, since it requires the batch-inverse of multi-dimensional covariance tensor. Here, we generalize an ad-hoc method, random feature selection, into semi-orthogonal embedding for robust approximation, cubically reducing the computational cost for the inverse of multi-dimensional covariance tensor. With the scrutiny of ablation studies, the proposed method achieves a new state-of-the-art with significant margins for the MVTec AD, KolektorSDD, KolektorSDD2, and mSTC datasets. The theoretical and empirical analyses offer insights and verification of our straightforward yet cost-effective approach.
Generalization Error Rates in Kernel Regression: The Crossover from the Noiseless to Noisy Regime
Cui, Hugo, Loureiro, Bruno, Krzakala, Florent, Zdeborová, Lenka
Kernel methods are among the most popular models in machine learning. Despite their relative simplicity, they define a powerful framework in which non-linear features can be exploited without leaving the realm of convex optimisation. Kernel methods in machine learning have a long and rich literature dating back to the 60s [1, 2], but have recently made it back to the spotlight as a proxy for studying neural networks in different regimes, e.g. the infinite width limit [3-6] and the lazy regime of training [7]. Despite being defined in terms of a non-parametric optimisation problem, kernel methods can be mathematically understood as a standard parametric linear problem in a (possibly infinite) Hilbert space spanned by the kernel eigenvectors (a.k.a features). This dual picture fully characterizes the asymptotic performance of kernels in terms of a trade-off between two key quantities: the relative decay of the eigenvalues of the kernel (a.k.a.
Picking Pearl From Seabed: Extracting Artefacts from Noisy Issue Triaging Collaborative Conversations for Hybrid Cloud Services
Azad, Amar Prakash, Ghosh, Supriyo, Gupta, Ajay, Kumar, Harshit, Mohapatra, Prateeti
Site Reliability Engineers (SREs) play a key role in issue identification and resolution. After an issue is reported, SREs come together in a virtual room (collaboration platform) to triage the issue. While doing so, they leave behind a wealth of information which can be used later for triaging similar issues. However, usability of the conversations offer challenges due to them being i) noisy and ii) unlabelled. This paper presents a novel approach for issue artefact extraction from the noisy conversations with minimal labelled data. We propose a combination of unsupervised and supervised model with minimum human intervention that leverages domain knowledge to predict artefacts for a small amount of conversation data and use that for fine-tuning an already pretrained language model for artefact prediction on a large amount of conversation data. Experimental results on our dataset show that the proposed ensemble of unsupervised and supervised model is better than using either one of them individually.