Accuracy
AI improves breast cancer risk prediction
Most existing breast cancer screening programs are based on mammography at similar time intervals -- typically, annually or every two years -- for all women. This "one size fits all" approach is not optimized for cancer detection on an individual level and may hamper the effectiveness of screening programs. "Risk prediction is an important building block of an individually adapted screening policy," said study lead author Karin Dembrower, M.D., breast radiologist and Ph.D. candidate from the Karolinska Institute in Stockholm, Sweden. "Effective risk prediction can improve attendance and confidence in screening programs." High breast density, or a greater amount of glandular and connective tissue compared to fat, is considered a risk factor for cancer.
Improving drug response prediction by integrating multiple data sources: matrix factorization, kernel and network-based approaches
Note: MF Matrix factorization; BMF Bayesian matrix factorization; KBMF Kernel Bayesian matrix factorization; KRR Kernel ridge regression; NBR Network based regression; NBC Network based classification; CV Cross validation; LOOCV Leave-one-out cross validation; PCC Pearson correlation coefficient; RMSE Root mean square error; MSE Mean square error; SCC Spearman correlation coefficient; NDCG Normalized discounted cumulative gain; R2 Coefficient of determination; NRMSE Normalized root mean squared error; AUC Area under curve; PPI Proteinโprotein interaction.
EnsemFDet: An Ensemble Approach to Fraud Detection based on Bipartite Graph
Ren, Yuxiang, Zhu, Hao, ZHang, Jiawei, Dai, Peng, Bo, Liefeng
Fraud detection is extremely critical for e-commerce business. It is the intent of the companies to detect and prevent fraud as early as possible. Existing fraud detection methods try to identify unexpected dense subgraphs and treat related nodes as suspicious. Spectral relaxation-based methods solve the problem efficiently but hurt the performance due to the relaxed constraints. Besides, many methods cannot be accelerated with parallel computation or control the number of returned suspicious nodes because they provide a set of subgraphs with diverse node sizes. These drawbacks affect the real-world applications of existing methods. In this paper, we propose an Ensemble-based Fraud Detection (EnsemFDet) method to scale up fraud detection in bipartite graphs by decomposing the original problem into subproblems on small-sized subgraphs. By oversampling the graph and solving the subproblems, the ensemble approach further votes suspicious nodes without sacrificing the prediction accuracy. Extensive experiments have been done on real transaction data from JD.com, which is one of the world's largest e-commerce platforms. Experimental results demonstrate the effectiveness, practicability, and scalability of EnsemFDet. More specifically, EnsemFDet is up to 100x faster than the state-of-the-art methods due to its parallelism with all aspects of data.
Privacy Attacks on Network Embeddings
Ellers, Michael, Cochez, Michael, Schumacher, Tobias, Strohmaier, Markus, Lemmerich, Florian
Data ownership and data protection are increasingly important topics with ethical and legal implications, e.g., with the right to erasure established in the European General Data Protection Regulation (GDPR). In this light, we investigate network embeddings, i.e., the representation of network nodes as low-dimensional vectors. We consider a typical social network scenario with nodes representing users and edges relationships between them. We assume that a network embedding of the nodes has been trained. After that, a user demands the removal of his data, requiring the full deletion of the corresponding network information, in particular the corresponding node and incident edges. In that setting, we analyze whether after the removal of the node from the network and the deletion of the vector representation of the respective node in the embedding significant information about the link structure of the removed node is still encoded in the embedding vectors of the remaining nodes. This would require a (potentially computationally expensive) retraining of the embedding. For that purpose, we deploy an attack that leverages information from the remaining network and embedding to recover information about the neighbors of the removed node. The attack is based on (i) measuring distance changes in network embeddings and (ii) a machine learning classifier that is trained on networks that are constructed by removing additional nodes. Our experiments demonstrate that substantial information about the edges of a removed node/user can be retrieved across many different datasets. This implies that to fully protect the privacy of users, node deletion requires complete retraining - or at least a significant modification - of original network embeddings. Our results suggest that deleting the corresponding vector representation from network embeddings alone is not sufficient from a privacy perspective.
A Systematic Comparison of Bayesian Deep Learning Robustness in Diabetic Retinopathy Tasks
Filos, Angelos, Farquhar, Sebastian, Gomez, Aidan N., Rudner, Tim G. J., Kenton, Zachary, Smith, Lewis, Alizadeh, Milad, de Kroon, Arnoud, Gal, Yarin
Evaluation of Bayesian deep learning (BDL) methods is challenging. We often seek to evaluate the methods' robustness and scalability, assessing whether new tools give `better' uncertainty estimates than old ones. These evaluations are paramount for practitioners when choosing BDL tools on-top of which they build their applications. Current popular evaluations of BDL methods, such as the UCI experiments, are lacking: Methods that excel with these experiments often fail when used in application such as medical or automotive, suggesting a pertinent need for new benchmarks in the field. We propose a new BDL benchmark with a diverse set of tasks, inspired by a real-world medical imaging application on \emph{diabetic retinopathy diagnosis}. Visual inputs (512x512 RGB images of retinas) are considered, where model uncertainty is used for medical pre-screening---i.e. to refer patients to an expert when model diagnosis is uncertain. Methods are then ranked according to metrics derived from expert-domain to reflect real-world use of model uncertainty in automated diagnosis. We develop multiple tasks that fall under this application, including out-of-distribution detection and robustness to distribution shift. We then perform a systematic comparison of well-tuned BDL techniques on the various tasks. From our comparison we conclude that some current techniques which solve benchmarks such as UCI `overfit' their uncertainty to the dataset---when evaluated on our benchmark these underperform in comparison to simpler baselines. The code for the benchmark, its baselines, and a simple API for evaluating new BDL tools are made available at https://github.com/oatml/bdl-benchmarks.
AEGR: A simple approach to gradient reversal in autoencoders for network anomaly detection
Babaei, Kasra, Chen, Zhi Yuan, Maul, Tomas
--Anomaly detection is referred to as a process in which the aim is to detect data points that follow a different pattern from the majority of data points. Anomaly detection methods suffer from several well-known challenges that hinder their performance such as high dimensionality. Autoencoders are unsupervised neural networks that have been used for the purpose of reducing dimensionality and also detecting network anomalies in large datasets. The performance of autoen-coders debilitates when the training set contains noise and anomalies. In this paper, a new gradient-reversal method is proposed to overcome the influence of anomalies on the training phase for the purpose of detecting network anomalies. The method is different from other approaches as it does not require an anomaly-free training set and is based on reconstruction error . Once latent variables are extracted from the network, Local Outlier Factor is used to separate normal data points from anomalies. A simple pruning approach and data augmentation is also added to further improve performance. The experimental results show that the proposed model can outperform other well-know approaches. In many real-world problems such as detecting fraudulent activities or detecting failure in aircraft engines, there is a pressing need to identify observations that have a striking dissimilarity compared to the majority. In medicine for instance, this discovery can lead to early detection of lung cancer or breast cancer.
Evaluating the Effectiveness of Margin Parameter when Learning Knowledge Embedding Representation for Domain-specific Multi-relational Categorized Data
Chung, Matthew Wai Heng, Tissot, Hegler
Learning knowledge representation is an increasingly important technology that supports a variety of machine learning related applications. However, the choice of hyperparameters is seldom justified and usually relies on exhaustive search. Understanding the effect of hyperparameter combinations on embedding quality is crucial to avoid the inefficient process and enhance practicality of vector representation methods. We evaluate the effects of distinct values for the margin parameter focused on translational embedding representation models for multi-relational categorized data. We assess the margin influence regarding the quality of embedding models by contrasting traditional link prediction task accuracy against a classification task. The findings provide evidence that lower values of margin are not rigorous enough to help with the learning process, whereas larger values produce much noise pushing the entities beyond to the surface of the hyperspace, thus requiring constant regularization. Finally, the correlation between link prediction and classification accuracy shows traditional validation protocol for embedding models is a weak metric to represent the quality of embedding representation.
Massive errors found in facial recognition tech, especially in case of nonwhites: U.S. study
WASHINGTON โ Facial recognition systems can produce wildly inaccurate results, especially for nonwhites, according to a U.S. government study released Thursday that is likely to raise fresh doubts on deployment of the artificial intelligence technology. The study of dozens of facial recognition algorithms showed "false positives" rates for Asians and African-Americans as much as 100 times higher than for whites. The researchers from the National Institute of Standards and Technology (NIST), a government research center, also found two algorithms assigned the wrong gender to black females almost 35 percent of the time. The study comes amid widespread deployment of facial recognition for law enforcement, airports, border security, banking, retailing, schools and for personal technology such as unlocking smartphones. Some activists and researchers have claimed the potential for errors is too great and that mistakes could result in the jailing of innocent people, and that the technology could be used to create databases that may be hacked or inappropriately used.
Data science for cybersecurity: A probabilistic time series model for detecting RDP inbound brute force attacks - Microsoft Security
Our approach to time series anomaly detection is computationally efficient, automatically learns how to update probabilities and adapt to changes in data. As we describe in the next section, this approach has yielded successful attack detection at high precision. The proposed time series anomaly detection model was deployed and utilized by Microsoft Threat Experts to detect RDP brute force attacks during threat hunting activities. A list that ranks machines across enterprises with the lowest anomaly scores (indicating the likelihood of observing a value at least as large under expected conditions in all signals considered) is updated and reviewed every day. See Table 1 for an example.
Destruction of Image Steganography using Generative Adversarial Networks
Corley, Isaac, Lwowski, Jonathan, Hoffman, Justin
Digital image steganalysis, or the detection of image steganography, has been studied in depth for years and is driven by Advanced Persistent Threat (APT) groups', such as APT37 Reaper, utilization of steganographic techniques to transmit additional malware to perform further post-exploitation activity on a compromised host. However, many steganalysis algorithms are constrained to work with only a subset of all possible images in the wild or are known to produce a high false positive rate. This results in blocking any suspected image being an unreasonable policy. A more feasible policy is to filter suspicious images prior to reception by the host machine. However, how does one optimally filter specifically to obfuscate or remove image steganography while avoiding degradation of visual image quality in the case that detection of the image was a false positive? We propose the Deep Digital Steganography Purifier (DDSP), a Generative Adversarial Network (GAN) which is optimized to destroy steganographic content without compromising the perceptual quality of the original image. As verified by experimental results, our model is capable of providing a high rate of destruction of steganographic image content while maintaining a high visual quality in comparison to other state-of-the-art filtering methods. Additionally, we test the transfer learning capability of generalizing to to obfuscate real malware payloads embedded into different image file formats and types using an unseen steganographic algorithm and prove that our model can in fact be deployed to provide adequate results.