Performance Analysis
Human-Expert-Level Brain Tumor Detection Using Deep Learning with Data Distillation and Augmentation
Lu, Diyuan, Polomac, Nenad, Gacheva, Iskra, Hattingen, Elke, Triesch, Jochen
The application of Deep Learning (DL) for medical diagnosis is often hampered by two problems. First, the amount of training data may be scarce, as it is limited by the number of patients who have acquired the condition to be diagnosed. Second, the training data may be corrupted by various types of noise. Here, we study the problem of brain tumor detection from magnetic resonance spectroscopy (MRS) data, where both types of problems are prominent. To overcome these challenges, we propose a new method for training a deep neural network that distills particularly representative training examples and augments the training data by mixing these samples from one class with those from the same and other classes to create additional training samples. We demonstrate that this technique substantially improves performance, allowing our method to reach human-expert-level accuracy with just a few thousand training examples. Interestingly, the network learns to rely on features of the data that are usually ignored by human experts, suggesting new directions for future research.
Modelling Credit Card Fraud Detection
Credit card frauds are a "still growing" problem in the world. Losses in frauds were estimated in more than US$27 billion in 2018 and are still projected to grow significantly for the next years as this article shows. With more and more people using credit cards in their daily routine, also increased the interest of criminals in opportunities to make money from that. The development of new technologies puts both criminals and credit card companies in a constant race to improve their systems and techniques. With that amount of money at stake, Machine Learning is surely not a new word for credit card companies, which have been investing on that long before it was a trend, to create and optimize models of risk and fraud management.
AI Learns from Lung CT Scans to Diagnose COVID-19
Although the initial wave of the SARS-CoV-2 pandemic has abated in many countries, healthcare providers are still looking to identify as many COVID-19 patients as possible and contain the disease. Fast and accurate diagnosis is especially important when unsuspecting patients with a coronavirus infection come to the hospital with health complaints but don't yet show symptoms of COVID-19. Nasal swab samples analyzed by RT-PCR are currently recommended for the diagnosis of COVID-19, however, supply shortages, a wait time of up to two days for results, and a false negative rate as high as 1 in 5 mean alternative, large-scale COVID-19 screening tools are still being sought. SARS-CoV-2 is known to damage lung tissue, and in a distinct way that doctors are now seeking to exploit for new diagnostic approaches. Many COVID-19 patients develop pneumonia, which can progress to respiratory failure and sometimes death.
Prediction of Cancer Microarray and DNA Methylation Data using Non-negative Matrix Factorization
Patel, Parth, Passi, Kalpdrum, Jain, Chakresh Kumar
Over the past few years, there has been a considerable spread of microarray technology in many biological patterns, particularly in those pertaining to cancer diseases like leukemia, prostate, colon cancer, etc. The primary bottleneck that one experiences in the proper understanding of such datasets lies in their dimensionality, and thus for an efficient and effective means of studying the same, a reduction in their dimension to a large extent is deemed necessary. This study is a bid to suggesting different algorithms and approaches for the reduction of dimensionality of such microarray datasets. This study exploits the matrix-like structure of such microarray data and uses a popular technique called Non-Negative Matrix Factorization (NMF) to reduce the dimensionality, primarily in the field of biological data. Classification accuracies are then compared for these algorithms. This technique gives an accuracy of 98%.
Revisiting Data Complexity Metrics Based on Morphology for Overlap and Imbalance: Snapshot, New Overlap Number of Balls Metrics and Singular Problems Prospect
Pascual-Triana, Josรฉ Daniel, Charte, David, Arroyo, Marta Andrรฉs, Fernรกndez, Alberto, Herrera, Francisco
Data Science and Machine Learning have become fundamental assets for companies and research institutions alike. As one of its fields, supervised classification allows for class prediction of new samples, learning from given training data. However, some properties can cause datasets to be problematic to classify. In order to evaluate a dataset a priori, data complexity metrics have been used extensively. They provide information regarding different intrinsic characteristics of the data, which serve to evaluate classifier compatibility and a course of action that improves performance. However, most complexity metrics focus on just one characteristic of the data, which can be insufficient to properly evaluate the dataset towards the classifiers' performance. In fact, class overlap, a very detrimental feature for the classification process (especially when imbalance among class labels is also present) is hard to assess. This research work focuses on revisiting complexity metrics based on data morphology. In accordance to their nature, the premise is that they provide both good estimates for class overlap, and great correlations with the classification performance. For that purpose, a novel family of metrics have been developed. Being based on ball coverage by classes, they are named after Overlap Number of Balls. Finally, some prospects for the adaptation of the former family of metrics to singular (more complex) problems are discussed.
Traceable raises $20 million for AI system that shields cloud app APIs from cyberattacks
Traceable, a startup developing an end-to-end cloud app security solution, today emerged from stealth with $20 million in venture equity financing. Newly flush with capital, CEO Jyoti Bansal intends to focus on acquiring customers globally while growing Traceable's team and accelerating R&D. Cloud-native apps are often built with hundreds or even thousands of API microservices (i.e., loosely coupled services), making them difficult to protect at scale. Gartner predicts that by 2022, API abuses will be the most frequent attack vector, which isn't surprising considering API calls represented 83% of web traffic as of 2018. Traceable ostensibly protects these APIs with machine learning algorithms that analyze app activity from the user and the session all the way down to the code.
Misclassification cost-sensitive ensemble learning: A unifying framework
Petrides, George, Verbeke, Wouter
The task of supervised machine learning is given a set of recorded observations and their outcomes to predict the outcome of new observations. Standard classification techniques aim for the highest overall accuracy or, equivalently, for the smallest total error, and include among others support vector machines, Bayesian classifiers, logistic regression, decision tree classifiers such as CART [6] and C4.5 [38], and ensemble methods which build several classifiers and aggregate their predictions such as Bagging [4], AdaBoost [16] and Random Forests [5]. Of particular interest in certain domains are binary classifiers which deal with cases where only two classes of outcomes are considered, such as fraudulent and legitimate credit card transactions, responders and non-responders to a marketing campaign, patients with and without cancer, intrusive and authorised network access, and defaulting and repaying debtors to name a few. In most of these cases, one of the classes is a small minority and consequently traditional classifiers might classify all of its members as belonging to the majority class without any significant overall accuracy loss. The severity of this class imbalance becomes more noticeable when failing to correctly predict a minority class member is more costly than doing so with a member of the majority class, as the case often is. A remedy to the undesirable situation just described are classifiers which, instead of accuracy, take misclassification costs into account and are thus termed cost-sensitive. We illustrate this idea in the credit card fraud detection framework: accepting a fraudulent transaction as legitimate incurs a cost equal to its amount.
ADSAGE: Anomaly Detection in Sequences of Attributed Graph Edges applied to insider threat detection at fine-grained level
Garchery, Mathieu, Granitzer, Michael
Previous works on the CERT insider threat detection case have neglected graph and text features despite their relevance to describe user behavior. Additionally, existing systems heavily rely on feature engineering and audit data aggregation to detect malicious activities. This is time consuming, requires expert knowledge and prevents tracing back alerts to precise user actions. To address these issues we introduce ADSAGE to detect anomalies in audit log events modeled as graph edges. Our general method is the first to perform anomaly detection at edge level while supporting both edge sequences and attributes, which can be numeric, categorical or even text. We describe how ADSAGE can be used for fine-grained, event level insider threat detection in different audit logs from the CERT use case. Remarking that there is no standard benchmark for the CERT problem, we use a previously proposed evaluation setting based on realistic recall-based metrics. We evaluate ADSAGE on authentication, email traffic and web browsing logs from the CERT insider threat datasets, as well as on real-world authentication events. ADSAGE is effective to detect anomalies in authentications, modeled as user to computer interactions, and in email communications. Simple baselines give surprisingly strong results as well. We also report performance split by malicious scenarios present in the CERT datasets: interestingly, several detectors are complementary and could be combined to improve detection. Overall, our results show that graph features are informative to characterize malicious insider activities, and that detection at fine-grained level is possible.
Approximating the Ideal Observer for joint signal detection and localization tasks by use of supervised learning methods
Zhou, Weimin, Li, Hua, Anastasio, Mark A.
Medical imaging systems are commonly assessed and optimized by use of objective measures of image quality (IQ). The Ideal Observer (IO) performance has been advocated to provide a figure-of-merit for use in assessing and optimizing imaging systems because the IO sets an upper performance limit among all observers. When joint signal detection and localization tasks are considered, the IO that employs a modified generalized likelihood ratio test maximizes observer performance as characterized by the localization receiver operating characteristic (LROC) curve. Computations of likelihood ratios are analytically intractable in the majority of cases. Therefore, sampling-based methods that employ Markov-Chain Monte Carlo (MCMC) techniques have been developed to approximate the likelihood ratios. However, the applications of MCMC methods have been limited to relatively simple object models. Supervised learning-based methods that employ convolutional neural networks have been recently developed to approximate the IO for binary signal detection tasks. In this paper, the ability of supervised learning-based methods to approximate the IO for joint signal detection and localization tasks is explored. Both background-known-exactly and background-known-statistically signal detection and localization tasks are considered. The considered object models include a lumpy object model and a clustered lumpy model, and the considered measurement noise models include Laplacian noise, Gaussian noise, and mixed Poisson-Gaussian noise. The LROC curves produced by the supervised learning-based method are compared to those produced by the MCMC approach or analytical computation when feasible. The potential utility of the proposed method for computing objective measures of IQ for optimizing imaging system performance is explored.
Developing Machine Learning Pipelines
Even the most experienced Data Scientists are not always familiar with the best practices involved with developing a Machine Learning pipeline. There is a lot of confusion about what steps should be involved, what should be their sequence and, in general, how to ensure that the insights you create are accurate and valuable. There is also a very limited number of good resources describing a practical and correct approach. However, after many data science projects, you begin to realise the approach to building a pipeline always remains the same. Machine Learning pipelines are modular, and, depending on the situation, some steps can be added or skipped.