Goto

Collaborating Authors

 Accuracy


Optirank: classification for RNA-Seq data with optimal ranking reference genes

arXiv.org Artificial Intelligence

Classification algorithms using RNA-Sequencing (RNA-Seq) data as input are used in a variety of biological applications. By nature, RNA-Seq data is subject to uncontrolled fluctuations both within and especially across datasets, which presents a major difficulty for a trained classifier to generalize to an external dataset. Replacing raw gene counts with the rank of gene counts inside an observation has proven effective to mitigate this problem. However, the rank of a feature is by definition relative to all other features, including highly variable features that introduce noise in the ranking. To address this problem and obtain more robust ranks, we propose a logistic regression model, optirank, which learns simultaneously the parameters of the model and the genes to use as a reference set in the ranking. We show the effectiveness of this method on simulated data. We also consider real classification tasks, which present different kinds of distribution shifts between train and test data. Those tasks concern a variety of applications, such as cancer of unknown primary classification, identification of specific gene signatures, and determination of cell type in single-cell RNA-Seq datasets. On those real tasks, optirank performs at least as well as the vanilla logistic regression on classical ranks, while producing sparser solutions. In addition, to increase the robustness against dataset shifts, we propose a multi-source learning scheme and demonstrate its effectiveness when used in combination with rank-based classifiers.


How far have we come with Zero Shot Learning part1(Machine Learning + NLP)

#artificialintelligence

Abstract: dentifying suicidality including suicidal ideation, attempts, and risk factors in electronic health record data in clinical notes is difficult. A major difficulty is the lack of training samples given the small number of true positive instances among the increasingly large number of patients being screened. This paper describes a novel methodology that identifies suicidality in clinical notes by addressing this data sparsity issue through zero-shot learning. U.S. Veterans Affairs clinical notes served as data. The training dataset label was determined using diagnostic codes of suicide attempt and self-harm.


User-Centered Security in Natural Language Processing

arXiv.org Artificial Intelligence

This dissertation proposes a framework of user-centered security in Natural Language Processing (NLP), and demonstrates how it can improve the accessibility of related research. Accordingly, it focuses on two security domains within NLP with great public interest. First, that of author profiling, which can be employed to compromise online privacy through invasive inferences. Without access and detailed insight into these models' predictions, there is no reasonable heuristic by which Internet users might defend themselves from such inferences. Secondly, that of cyberbullying detection, which by default presupposes a centralized implementation; i.e., content moderation across social platforms. As access to appropriate data is restricted, and the nature of the task rapidly evolves (both through lexical variation, and cultural shifts), the effectiveness of its classifiers is greatly diminished and thereby often misrepresented. Under the proposed framework, we predominantly investigate the use of adversarial attacks on language; i.e., changing a given input (generating adversarial samples) such that a given model does not function as intended. These attacks form a common thread between our user-centered security problems; they are highly relevant for privacy-preserving obfuscation methods against author profiling, and adversarial samples might also prove useful to assess the influence of lexical variation and augmentation on cyberbullying detection.


Federated PCA on Grassmann Manifold for Anomaly Detection in IoT Networks

arXiv.org Artificial Intelligence

In the era of Internet of Things (IoT), network-wide anomaly detection is a crucial part of monitoring IoT networks due to the inherent security vulnerabilities of most IoT devices. Principal Components Analysis (PCA) has been proposed to separate network traffics into two disjoint subspaces corresponding to normal and malicious behaviors for anomaly detection. However, the privacy concerns and limitations of devices' computing resources compromise the practical effectiveness of PCA. We propose a federated PCA-based Grassmannian optimization framework that coordinates IoT devices to aggregate a joint profile of normal network behaviors for anomaly detection. First, we introduce a privacy-preserving federated PCA framework to simultaneously capture the profile of various IoT devices' traffic. Then, we investigate the alternating direction method of multipliers gradient-based learning on the Grassmann manifold to guarantee fast training and the absence of detecting latency using limited computational resources. Empirical results on the NSL-KDD dataset demonstrate that our method outperforms baseline approaches. Finally, we show that the Grassmann manifold algorithm is highly adapted for IoT anomaly detection, which permits drastically reducing the analysis time of the system. To the best of our knowledge, this is the first federated PCA algorithm for anomaly detection meeting the requirements of IoT networks.


Reconstructing Sparse Multiplex Networks with Application to Covert Networks

arXiv.org Artificial Intelligence

Network structure provides critical information for understanding the dynamic behavior of networks. However, the complete structure of real-world networks is often unavailable, thus it is crucially important to develop approaches to infer a more complete structure of networks. In this paper, we integrate the configuration model for generating random networks into an Expectation-Maximization-Aggregation (EMA) framework to reconstruct the complete structure of multiplex networks. We validate the proposed EMA framework against the random model on several real-world multiplex networks, including both covert and overt ones. It is found that the EMA framework generally achieves the best predictive accuracy compared to the EM framework and the random model. As the number of layers increases, the performance improvement of EMA over EM decreases. The inferred multiplex networks can be leveraged to inform the decision-making on monitoring covert networks as well as allocating limited resources for collecting additional information to improve reconstruction accuracy. For law enforcement agencies, the inferred complete network structure can be used to develop more effective strategies for covert network interdiction.


Causal Discovery from Sparse Time-Series Data Using Echo State Network

arXiv.org Artificial Intelligence

Causal discovery between collections of time-series data can help diagnose causes of symptoms and hopefully prevent faults before they occur. However, reliable causal discovery can be very challenging, especially when the data acquisition rate varies (i.e., non-uniform data sampling), or in the presence of missing data points (e.g., sparse data sampling). To address these issues, we proposed a new system comprised of two parts, the first part fills missing data with a Gaussian Process Regression, and the second part leverages an Echo State Network, which is a type of reservoir computer (i.e., used for chaotic system modelling) for Causal discovery. We evaluate the performance of our proposed system against three other off-the-shelf causal discovery algorithms, namely, structural expectation-maximization, sub-sampled linear auto-regression absolute coefficients, and multivariate Granger Causality with vector auto-regressive using the Tennessee Eastman chemical dataset; we report on their corresponding Matthews Correlation Coefficient(MCC) and Receiver Operating Characteristic curves (ROC) and show that the proposed system outperforms existing algorithms, demonstrating the viability of our approach to discover causal relationships in a complex system with missing entries.


Value Cards: An Educational Toolkit for Teaching Social Impacts of Machine Learning through Deliberation

arXiv.org Artificial Intelligence

Recently, there have been increasing calls for computer science curricula to complement existing technical training with topics related to Fairness, Accountability, Transparency, and Ethics. In this paper, we present Value Card, an educational toolkit to inform students and practitioners of the social impacts of different machine learning models via deliberation. This paper presents an early use of our approach in a college-level computer science course. Through an in-class activity, we report empirical data for the initial effectiveness of our approach. Our results suggest that the use of the Value Cards toolkit can improve students' understanding of both the technical definitions and trade-offs of performance metrics and apply them in real-world contexts, help them recognize the significance of considering diverse social values in the development of deployment of algorithmic systems, and enable them to communicate, negotiate and synthesize the perspectives of diverse stakeholders. Our study also demonstrates a number of caveats we need to consider when using the different variants of the Value Cards toolkit. Finally, we discuss the challenges as well as future applications of our approach.


FOLD-SE: An Efficient Rule-based Machine Learning Algorithm with Scalable Explainability

arXiv.org Artificial Intelligence

We present FOLD-SE, an efficient, explainable machine learning algorithm for classification tasks given tabular data containing numerical and categorical values. FOLD-SE generates a set of default rules-essentially a stratified normal logic program-as an (explainable) trained model. Explainability provided by FOLD-SE is scalable, meaning that regardless of the size of the dataset, the number of learned rules and learned literals stay quite small while good accuracy in classification is maintained. A model with smaller number of rules and literals is easier to understand for human beings. FOLD-SE is competitive with state-of-the-art machine learning algorithms such as XGBoost and Multi-Layer Perceptrons (MLP) wrt accuracy of prediction. However, unlike XGBoost and MLP, the FOLD-SE algorithm is explainable. The FOLD-SE algorithm builds upon our earlier work on developing the explainable FOLD-R++ machine learning algorithm for binary classification and inherits all of its positive features. Thus, pre-processing of the dataset, using techniques such as one-hot encoding, is not needed. Like FOLD-R++, FOLD-SE uses prefix sum to speed up computations resulting in FOLD-SE being an order of magnitude faster than XGBoost and MLP in execution speed. The FOLD-SE algorithm outperforms FOLD-R++ as well as other rule-learning algorithms such as RIPPER in efficiency, performance and scalability, especially for large datasets. A major reason for scalable explainability of FOLD-SE is the use of a literal selection heuristics based on Gini Impurity, as opposed to Information Gain used in FOLD-R++. A multi-category classification version of FOLD-SE is also presented.


How AI/ML Can Thwart DDoS Attacks

#artificialintelligence

After early excitement about artificial intelligence (AI) in the late 1980s and early 1990s, followed by a couple of "AI winters" -- periods of reduced funding, interest and even disillusionment -- we now again see great enthusiasm about all things related to AI and machine learning (ML). It is no wonder that AI/ML is also being considered for network security, including distributed denial-of-service (DDoS) protection. It's not that AI/ML algorithms have changed so radically -- but they have matured. In network security, like in many other fields, the abundance of data and greater-than-ever processing power makes it feasible to implement new AI/ML algorithms in silicon or in the cloud, allowing us to teach machines to be more accurate and faster than humans are. With DDoS security, the problem is distinguishing "good" from "bad" traffic and minimizing the mitigative actions to reduce the effect on "good" traffic.


Upward lightning at wind turbines: Risk assessment from larger-scale meteorology

arXiv.org Artificial Intelligence

Upward lightning (UL) has become an increasingly important threat to wind turbines as ever more of them are being installed for renewably producing electricity. The taller the wind turbine the higher the risk that the type of lightning striking the man-made structure is UL. UL can be much more destructive than downward lightning due to its long lasting initial continuous current leading to a large charge transfer within the lightning discharge process. Current standards for the risk assessment of lightning at wind turbines mainly take the summer lightning activity into account, which is inferred from LLS. Ground truth lightning current measurements reveal that less than 50% of UL might be detected by lightning location systems (LLS). This leads to a large underestimation of the proportion of LLS-non-detectable UL at wind turbines, which is the dominant lightning type in the cold season. This study aims to assess the risk of LLS-detectable and LLS-non-detectable UL at wind turbines using direct UL measurements at the Gaisberg Tower (Austria) and S\"antis Tower (Switzerland). Direct UL observations are linked to meteorological reanalysis data and joined by random forests, a powerful machine learning technique. The meteorological drivers for the non-/occurrence of LLS-detectable and LLS-non-detectable UL, respectively, are found from the random forest models trained at the towers and have large predictive skill on independent data. In a second step the results from the tower-trained models are extended to a larger study domain (Central and Northern Germany). The tower-trained models for LLS-detectable lightning is independently verified at wind turbine locations in that domain and found to reliably diagnose that type of UL. Risk maps based on case study events show that high diagnosed probabilities in the study domain coincide with actual UL events.