Goto

Collaborating Authors

 Accuracy


Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting

arXiv.org Machine Learning

Machine learning algorithms, when applied to sensitive data, pose a distinct threat to privacy. A growing body of prior work demonstrates that models produced by these algorithms may leak specific private information in the training data to an attacker, either through the models' structure or their observable behavior. However, the underlying cause of this privacy risk is not well understood beyond a handful of anecdotal accounts that suggest overfitting and influence might play a role. This paper examines the effect that overfitting and influence have on the ability of an attacker to learn information about the training data from machine learning models, either through training set membership inference or attribute inference attacks. Using both formal and empirical analyses, we illustrate a clear relationship between these factors and the privacy risk that arises in several popular machine learning algorithms. We find that overfitting is sufficient to allow an attacker to perform membership inference and, when the target attribute meets certain conditions about its influence, attribute inference attacks. Interestingly, our formal analysis also shows that overfitting is not necessary for these attacks and begins to shed light on what other factors may be in play. Finally, we explore the connection between membership inference and attribute inference, showing that there are deep connections between the two that lead to effective new attacks.


5 Questions to Ask about Machine Learning

#artificialintelligence

How tired are we of "artificial intelligence" and "machine learning" being sprinkled like pixie dust on every product being hawked by vendors? The challenge for cybersecurity professionals is to see through the fog and figure out what's real and what's just marketing hyperbole. Often, marketing hyperbole exceeds the reality. Notoriously, Tesla's Autopilot sensors can be fooled in certain edge conditions, iPhone X can be fooled to unlock a phone by a doppelganger, and Apple's Siri isn't very good at taking directions. Even the winning team in the DARPA Cyber Grand Challenge lost spectacularly to actual hackers at the DEFCON conference following its win against other machines at Black Hat.


Let's chat about chatbots

#artificialintelligence

A chatbot is a computer program that uses natural language processing (NLP) and artificial intelligence to simulate human conversation and derive a response. Essentially, it's a machine that can chat with you or respond to your chatter. Chatbots can save time and money when used to handle simple, automated tasks. Bots were hot in 2017 but many bots are still primitive. These assistants incorporate more sophisticated NLP and deeper AI to infer a better response.


How to solve 90% of NLP problems: A step-by-step guide

#artificialintelligence

Hurry--early price ends March 9. This post was originally published on Insight Data Science; it is republished here with permission. Whether you are an established company or working to launch a new service, you can always leverage text data to validate, improve, and expand the functionalities of your product. The science of extracting meaning and learning from text data is an active topic of research called natural language processing (NLP). NLP produces new and exciting results on a daily basis, and is a very large field.


Modeling polypharmacy side effects with graph convolutional networks

arXiv.org Machine Learning

The use of multiple drugs, termed polypharmacy, is common to treat patients with complex diseases or co-existing medical conditions. However, a major consequence of polypharmacy is a much higher risk of side effects for the patient. Polypharmacy side effects emerge because of drug interactions, in which activity of one drug may change, favorably or unfavorably, if taken with another drug. The knowledge of drug interactions is limited because these complex relationships are usually not observed in small clinical testing. Discovering polypharmacy side effects thus remains a challenge with significant implications for patient mortality and morbidity. Here we introduce Decagon, an approach for modeling polypharmacy side effects. The approach constructs a multimodal graph of protein-protein interactions, drug-protein interactions, and the polypharmacy side effects, which are represented as drug-drug interactions, where each side effect is an edge of a different type. Decagon is developed specifically to handle such multimodal graphs with a large number of edge types. Our approach develops a new graph convolutional neural network for multirelational link prediction in multimodal networks. Unlike approaches limited to predicting simple drug-drug interaction values, Decagon can predict the exact side effect, if any, through which a given drug combination manifests clinically. Decagon accurately predicts polypharmacy side effects, outperforming baselines by up to 69%. Furthermore, Decagon models particularly well side effects that have a strong molecular basis, while on predominantly non-molecular side effects, it achieves good performance because of effective sharing of model parameters across edge types. Decagon creates an opportunity to use large molecular and patient population data to flag and prioritize polypharmacy side effects for follow-up analysis via formal pharmacological studies.


Onto2Vec: joint vector-based representation of biological entities and their ontology-based annotations

arXiv.org Artificial Intelligence

Motivation: Biological knowledge is widely represented in the form of ontology-based annotations: ontologies describe the phenomena assumed to exist within a domain, and the annotations associate a (kind of) biological entity with a set of phenomena within the domain. The structure and information contained in ontologies and their annotations makes them valuable for developing machine learning, data analysis and knowledge extraction algorithms; notably, semantic similarity is widely used to identify relations between biological entities, and ontology-based annotations are frequently used as features in machine learning applications. Results: We propose the Onto2Vec method, an approach to learn feature vectors for biological entities based on their annotations to biomedical ontologies. Our method can be applied to a wide range of bioinformatics research problems such as similarity-based prediction of interactions between proteins, classification of interaction types using supervised learning, or clustering. To evaluate Onto2Vec, we use the Gene Ontology (GO) and jointly produce dense vector representations of proteins, the GO classes to which they are annotated, and the axioms in GO that constrain these classes.


Nonparametric Quantile-Based Causal Discovery

arXiv.org Machine Learning

Telling cause from effect using observational data is a challenging problem, especially in the bivariate case. Contemporary methods often assume an independence between the cause and the generating mechanism of the effect given the cause. From this postulate, they derive asymmetries to uncover causal relationships. In this work, we propose such an approach, based on the link between Kolmogorov complexity and quantile scoring. We use a nonparametric conditional quantile estimator based on copulas to implement our procedure, thus avoiding restrictive assumptions about the joint distribution between cause and effect. In an extensive study on real and synthetic data, we show that quantile copula causal discovery (QCCD) compares favorably to state-of-the-art methods, while at the same time being computationally efficient and scalable.


Deep Multi-view Learning to Rank

arXiv.org Machine Learning

--We study the problem of learning to rank from multiple sources. Though multi-view learning and learning to rank have been studied extensively leading to a wide range of applications, multi-view learning to rank as a synergy of both topics has received little attention. The aim of the paper is to propose a composite ranking method while keeping a close correlation with the individual rankings simultaneously . We propose a multi-objective solution to ranking by capturing the information of the feature mapping from both within each view as well as across views using autoencoder-like networks. Moreover, a novel end-to-end solution is introduced to enhance the joint ranking with minimum view-specific ranking loss, so that we can achieve the maximum global view agreements within a single optimization process. The proposed method is validated on a wide variety of ranking problems, including university ranking, multi-view lingual text ranking and image data ranking, providing superior results. Learning to rank is an important research topic in information retrieval and data mining, which aims to learn a ranking model to produce a query-specfic ranking list. The ranking model establishes a relationship between each pair of data samples by combining the corresponding features in an optimal way [1]. A score is then assigned to each pair to evaluate its relevance forming a global ranking list across all pairs. The success of learning to rank solutions has brought a wide spectrum of applications, including online advertising [2], natural language processing [3] and multimedia retrieval [4]. Learning appropriate data representation and a suitable scoring function are two vital steps in the ranking problem. T raditionally, a feature mapping models the data distribution in a latent space to match the relevance relationship, while the scoring function is used to quantify the relevance measure [1]; however, the ranking problem in the real world emerges from multiple facets and data patterns are mined from diverse domains.


DxNAT - Deep Neural Networks for Explaining Non-Recurring Traffic Congestion

arXiv.org Machine Learning

Non-recurring traffic congestion is caused by temporary disruptions, such as accidents, sports games, adverse weather, etc. We use data related to real-time traffic speed, jam factors (a traffic congestion indicator), and events collected over a year from Nashville, TN to train a multi-layered deep neural network. The traffic dataset contains over 900 million data records. The network is thereafter used to classify the real-time data and identify anomalous operations. Compared with traditional approaches of using statistical or machine learning techniques, our model reaches an accuracy of 98.73 percent when identifying traffic congestion caused by football games. Our approach first encodes the traffic across a region as a scaled image. After that the image data from different timestamps is fused with event- and time-related data. Then a crossover operator is used as a data augmentation method to generate training datasets with more balanced classes. Finally, we use the receiver operating characteristic (ROC) analysis to tune the sensitivity of the classifier. We present the analysis of the training time and the inference time separately.


Tournament Leave-pair-out Cross-validation for Receiver Operating Characteristic (ROC) Analysis

arXiv.org Machine Learning

Receiver operating characteristic (ROC) analysis is widely used for evaluating diagnostic systems. Recent studies have shown that estimating an area under ROC curve (AUC) with standard cross-validation methods suffers from a large bias. The leave-pair-out (LPO) cross-validation has been shown to correct this bias. However, while LPO produces an almost unbiased estimate of AUC, it does not provide a ranking of the data needed for plotting and analyzing the ROC curve. In this study, we propose a new method called tournament leave-pair-out (TLPO) cross-validation. This method extends LPO by creating a tournament from pair comparisons to produce a ranking for the data. TLPO preserves the advantage of LPO for estimating AUC, while it also allows performing ROC analysis. We have shown using both synthetic and real world data that TLPO is as reliable as LPO for AUC estimation and confirmed the bias in leave-one-out cross-validation on low-dimensional data.