Goto

Collaborating Authors

 Accuracy


Explainable Intrusion Detection Systems (X-IDS): A Survey of Current Methods, Challenges, and Opportunities

arXiv.org Artificial Intelligence

The application of Artificial Intelligence (AI) and Machine Learning (ML) to cybersecurity challenges has gained traction in industry and academia, partially as a result of widespread malware attacks on critical systems such as cloud infrastructures and government institutions. Intrusion Detection Systems (IDS), using some forms of AI, have received widespread adoption due to their ability to handle vast amounts of data with a high prediction accuracy. These systems are hosted in the organizational Cyber Security Operation Center (CSoC) as a defense tool to monitor and detect malicious network flow that would otherwise impact the Confidentiality, Integrity, and Availability (CIA). CSoC analysts rely on these systems to make decisions about the detected threats. However, IDSs designed using Deep Learning (DL) techniques are often treated as black box models and do not provide a justification for their predictions. This creates a barrier for CSoC analysts, as they are unable to improve their decisions based on the model's predictions. One solution to this problem is to design explainable IDS (X-IDS). This survey reviews the state-of-the-art in explainable AI (XAI) for IDS, its current challenges, and discusses how these challenges span to the design of an X-IDS. In particular, we discuss black box and white box approaches comprehensively. We also present the tradeoff between these approaches in terms of their performance and ability to produce explanations. Furthermore, we propose a generic architecture that considers human-in-the-loop which can be used as a guideline when designing an X-IDS. Research recommendations are given from three critical viewpoints: the need to define explainability for IDS, the need to create explanations tailored to various stakeholders, and the need to design metrics to evaluate explanations.


Towards A Holistic View of Bias in Machine Learning: Bridging Algorithmic Fairness and Imbalanced Learning

arXiv.org Artificial Intelligence

Machine learning (ML) is playing an increasingly important role in rendering decisions that affect a broad range of groups in society. ML models inform decisions in criminal justice, the extension of credit in banking, and the hiring practices of corporations. This posits the requirement of model fairness, which holds that automated decisions should be equitable with respect to protected features (e.g., gender, race, or age) that are often under-represented in the data. We postulate that this problem of under-representation has a corollary to the problem of imbalanced data learning. This class imbalance is often reflected in both classes and protected features. For example, one class (those receiving credit) may be over-represented with respect to another class (those not receiving credit) and a particular group (females) may be under-represented with respect to another group (males). A key element in achieving algorithmic fairness with respect to protected groups is the simultaneous reduction of class and protected group imbalance in the underlying training data, which facilitates increases in both model accuracy and fairness. We discuss the importance of bridging imbalanced learning and group fairness by showing how key concepts in these fields overlap and complement each other; and propose a novel oversampling algorithm, Fair Oversampling, that addresses both skewed class distributions and protected features. Our method: (i) can be used as an efficient pre-processing algorithm for standard ML algorithms to jointly address imbalance and group equity; and (ii) can be combined with fairness-aware learning algorithms to improve their robustness to varying levels of class imbalance. Additionally, we take a step toward bridging the gap between fairness and imbalanced learning with a new metric, Fair Utility, that combines balanced accuracy with fairness.


ACLNet: An Attention and Clustering-based Cloud Segmentation Network

arXiv.org Artificial Intelligence

We propose a novel deep learning model named ACLNet, for cloud segmentation from ground images. ACLNet uses both deep neural network and machine learning (ML) algorithm to extract complementary features. Specifically, it uses EfficientNet-B0 as the backbone, "`a trous spatial pyramid pooling" (ASPP) to learn at multiple receptive fields, and "global attention module" (GAM) to extract finegrained details from the image. ACLNet also uses k-means clustering to extract cloud boundaries more precisely. ACLNet is effective for both daytime and nighttime images. It provides lower error rate, higher recall and higher F1-score than state-of-art cloud segmentation models. The source-code of ACLNet is available here: https://github.com/ckmvigil/ACLNet.


Orthogonal-Coding-Based Feature Generation for Transductive Open-Set Recognition via Dual-Space Consistent Sampling

arXiv.org Artificial Intelligence

Open-set recognition (OSR) aims to simultaneously detect unknown-class samples and classify known-class samples. Most of the existing OSR methods are inductive methods, which generally suffer from the domain shift problem that the learned model from the known-class domain might be unsuitable for the unknown-class domain. Addressing this problem, inspired by the success of transductive learning for alleviating the domain shift problem in many other visual tasks, we propose an Iterative Transductive OSR framework, called IT-OSR, which implements three explored modules iteratively, including a reliability sampling module, a feature generation module, and a baseline update module. Specifically, at each iteration, a dual-space consistent sampling approach is presented in the explored reliability sampling module for selecting some relatively more reliable ones from the test samples according to their pseudo labels assigned by a baseline method, which could be an arbitrary inductive OSR method. Then, a conditional dual-adversarial generative network under an orthogonal coding condition is designed in the feature generation module to generate discriminative sample features of both known and unknown classes according to the selected test samples with their pseudo labels. Finally, the baseline method is updated for sample re-prediction in the baseline update module by jointly utilizing the generated features, the selected test samples with pseudo labels, and the training samples. Extensive experimental results on both the standard-dataset and the cross-dataset settings demonstrate that the derived transductive methods, by introducing two typical inductive OSR methods into the proposed IT-OSR framework, achieve better performances than 15 state-of-the-art methods in most cases.


PhishSim: Aiding Phishing Website Detection with a Feature-Free Tool

arXiv.org Artificial Intelligence

In this paper, we propose a feature-free method for detecting phishing websites using the Normalized Compression Distance (NCD), a parameter-free similarity measure which computes the similarity of two websites by compressing them, thus eliminating the need to perform any feature extraction. It also removes any dependence on a specific set of website features. This method examines the HTML of webpages and computes their similarity with known phishing websites, in order to classify them. We use the Furthest Point First algorithm to perform phishing prototype extractions, in order to select instances that are representative of a cluster of phishing webpages. We also introduce the use of an incremental learning algorithm as a framework for continuous and adaptive detection without extracting new features when concept drift occurs. On a large dataset, our proposed method significantly outperforms previous methods in detecting phishing websites, with an AUC score of 98.68%, a high true positive rate (TPR) of around 90%, while maintaining a low false positive rate (FPR) of 0.58%. Our approach uses prototypes, eliminating the need to retain long term data in the future, and is feasible to deploy in real systems with a processing time of roughly 0.3 seconds.


Understanding Unfairness in Fraud Detection through Model and Data Bias Interactions

arXiv.org Artificial Intelligence

In recent years, machine learning algorithms have become ubiquitous in a multitude of high-stakes decision-making applications. The unparalleled ability of machine learning algorithms to learn patterns from data also enables them to incorporate biases embedded within. A biased model can then make decisions that disproportionately harm certain groups in society -- limiting their access to financial services, for example. The awareness of this problem has given rise to the field of Fair ML, which focuses on studying, measuring, and mitigating unfairness in algorithmic prediction, with respect to a set of protected groups (e.g., race or gender). However, the underlying causes for algorithmic unfairness still remain elusive, with researchers divided between blaming either the ML algorithms or the data they are trained on. In this work, we maintain that algorithmic unfairness stems from interactions between models and biases in the data, rather than from isolated contributions of either of them. To this end, we propose a taxonomy to characterize data bias and we study a set of hypotheses regarding the fairness-accuracy trade-offs that fairness-blind ML algorithms exhibit under different data bias settings. On our real-world account-opening fraud use case, we find that each setting entails specific trade-offs, affecting fairness in expected value and variance -- the latter often going unnoticed. Moreover, we show how algorithms compare differently in terms of accuracy and fairness, depending on the biases affecting the data. Finally, we note that under specific data bias conditions, simple pre-processing interventions can successfully balance group-wise error rates, while the same techniques fail in more complex settings.


Domain adaptation strategies for cancer-independent detection of lymph node metastases

arXiv.org Artificial Intelligence

Recently, large, high-quality public datasets have led to the development of convolutional neural networks that can detect lymph node metastases of breast cancer at the level of expert pathologists. Many cancers, regardless of the site of origin, can metastasize to lymph nodes. However, collecting and annotating high-volume, high-quality datasets for every cancer type is challenging. In this paper we investigate how to leverage existing high-quality datasets most efficiently in multi-task settings for closely related tasks. Specifically, we will explore different training and domain adaptation strategies, including prevention of catastrophic forgetting, for colon and head-and-neck cancer metastasis detection in lymph nodes. Our results show state-of-the-art performance on both cancer metastasis detection tasks. Furthermore, we show the effectiveness of repeated adaptation of networks from one cancer type to another to obtain multi-task metastasis detection networks. Last, we show that leveraging existing high-quality datasets can significantly boost performance on new target tasks and that catastrophic forgetting can be effectively mitigated using regularization.


Artificial intelligence in cardiovascular medicine

#artificialintelligence

Artificial intelligence (AI) is a rapidly evolving transdisciplinary field employing machine learning (ML) techniques, which aim to simulate human intuition to offer cost-effective and scalable solutions to better manage CVD. ML algorithms are increasingly being developed and applied in various facets of cardiovascular medicine, including and not limited to heart failure, electrophysiology, valvular heart disease and coronary artery disease. Within heart failure, AI algorithms can augment diagnostic capabilities and clinical decision-making through automated cardiac measurements. Occult cardiac disease is increasingly being identified using ML from diagnostic data. Improved diagnostic and prognostic capabilities using ML algorithms are enhancing clinical care of patients with valvular heart disease and coronary artery disease. The growth of AI techniques is not without inherent challenges, most important of which is the need for greater external validation through multicenter, prospective clinical trials.


Confusion Matrices - Part 2

#artificialintelligence

This post takes off where the last one left off and talks about building confusion matrices for multi-class classification problems. We load the Iris dataset, split it into training and test sets, build a K-Nearest Neighbors (k-NN) classifier that attempts to predict the class of Iris plant (setosa, versicolor, or virginica), and craft a confusion matrix using these predictions. We then describe some additional metrics, including the macro and micro precision, and discuss sklearn's classification_report, discussing the $F_1$ metric and delving slightly deeper into the $F_{0.5}$ In the end, we discuss the classification_report for the confusion matrix we built on the Iris dataset. Let's import the needed libraries and set the matplotlib and seaborn settings.


DonorsChoose.org Application Screening

#artificialintelligence

DonorsChoose.org is a United States-based nonprofit organization that allows individuals to donate directly to public school classroom projects. At any given time, there are thousands of classroom requests that can be brought to life with a gift of any amount. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org The goal of the competition is to predict whether or not a DonorsChoose.org The competition dataset contains information from teachers' project applications to DonorsChoose.org