Accuracy
Open SESAME: Fighting Botnets with Seed Reconstructions of Domain Generation Algorithms
Weissgerber, Nils, Jenke, Thorsten, Padilla, Elmar, Bruckschen, Lilli
An important aspect of many botnets is their capability to generate pseudorandom domain names using Domain Generation Algorithms (DGAs). A cyber criminal can register such domains to establish periodically changing rendezvous points with the bots. DGAs make use of seeds to generate sets of domains. Seeds can easily be changed in order to generate entirely new groups of domains while using the same underlying algorithm. While this requires very little manual effort for an adversary, security specialists typically have to manually reverse engineer new malware strains to reconstruct the seeds. Only when the seed and DGA are known, past and future domains can be generated, efficiently attributed, blocked, sinkholed or used for a take-down. Common counters in the literature consist of databases or Machine Learning (ML) based detectors to keep track of past and future domains of known DGAs and to identify DGA-generated domain names, respectively. However, database based approaches can not detect domains generated by new DGAs, and ML approaches can not generate future domain names. In this paper, we introduce SESAME, a system that combines the two above-mentioned approaches and contains a module for automatic Seed Reconstruction, which is, to our knowledge, the first of its kind. It is used to automatically classify domain names, rate their novelty, and determine the seeds of the underlying DGAs. SESAME consists of multiple DGA-specific Seed Reconstructors and is designed to work purely based on domain names, as they are easily obtainable from observing the network traffic. We evaluated our approach on 20.8 gigabytes of DNS-lookups. Thereby, we identified 17 DGAs, of which 4 were entirely new to us.
Sharpening Ponzi Schemes Detection on Ethereum with Machine Learning
Galletta, Letterio, Pinelli, Fabio
Blockchain technology has been successfully exploited for deploying new economic applications. However, it has started arousing the interest of malicious users who deliver scams to deceive honest users and to gain economic advantages. Among the various scams, Ponzi schemes are one of the most common. Here, we present an automatic technique for detecting smart Ponzi contracts on Ethereum. We release a reusable data set with 4422 unique real-world smart contracts. Then, we introduce a new set of features that allow us to improve the classification. Finally, we identify a small and effective set of features that ensures a good classification quality.
NRBdMF: A recommendation algorithm for predicting drug effects considering directionality
Azuma, Iori, Mizuno, Tadahaya, Kusuhara, Hiroyuki
Predicting the novel effects of drugs based on information about approved drugs can be regarded as a recommendation system. Matrix factorization is one of the most used recommendation systems and various algorithms have been devised for it. A literature survey and summary of existing algorithms for predicting drug effects demonstrated that most such methods, including neighborhood regularized logistic matrix factorization, which was the best performer in benchmark tests, used a binary matrix that considers only the presence or absence of interactions. However, drug effects are known to have two opposite aspects, such as side effects and therapeutic effects. In the present study, we proposed using neighborhood regularized bidirectional matrix factorization (NRBdMF) to predict drug effects by incorporating bidirectionality, which is a characteristic property of drug effects. We used this proposed method for predicting side effects using a matrix that considered the bidirectionality of drug effects, in which known side effects were assigned a positive label (plus 1) and known treatment effects were assigned a negative (minus 1) label. The NRBdMF model, which utilizes drug bidirectional information, achieved enrichment of side effects at the top and indications at the bottom of the prediction list. This first attempt to consider the bidirectional nature of drug effects using NRBdMF showed that it reduced false positives and produced a highly interpretable output.
Boosting Out-of-Distribution Detection with Multiple Pre-trained Models
Xue, Feng, He, Zi, Xie, Chuanlong, Tan, Falong, Li, Zhenguo
Out-of-Distribution (OOD) detection, i.e., identifying whether an input is sampled from a novel distribution other than the training distribution, is a critical task for safely deploying machine learning systems in the open world. Recently, post hoc detection utilizing pre-trained models has shown promising performance and can be scaled to large-scale problems. This advance raises a natural question: Can we leverage the diversity of multiple pre-trained models to improve the performance of post hoc detection methods? In this work, we propose a detection enhancement method by ensembling multiple detection decisions derived from a zoo of pre-trained models. Our approach uses the p-value instead of the commonly used hard threshold and leverages a fundamental framework of multiple hypothesis testing to control the true positive rate of In-Distribution (ID) data. We focus on the usage of model zoos and provide systematic empirical comparisons with current state-of-the-art methods on various OOD detection benchmarks. The proposed ensemble scheme shows consistent improvement compared to single-model detectors and significantly outperforms the current competitive methods. Our method substantially improves the relative performance by 65.40% and 26.96% on the CIFAR10 and ImageNet benchmarks.
Learning Invariant Rules from Data for Interpretable Anomaly Detection
In the research area of anomaly detection, novel and promising methods are frequently developed. However, most existing studies exclusively focus on the detection task only and ignore the interpretability of the underlying models as well as their detection results. Nevertheless, anomaly interpretation, which aims to provide explanation of why specific data instances are identified as anomalies, is an equally important task in many real-world applications. In this work, we propose a novel framework which synergizes several machine learning and data mining techniques to automatically learn invariant rules that are consistently satisfied in a given dataset. The learned invariant rules can provide explicit explanation of anomaly detection results in the inference phase and thus are extremely useful for subsequent decision-making regarding reported anomalies. Furthermore, our empirical evaluation shows that the proposed method can also achieve comparable or even better performance in terms of AUC and partial AUC on public benchmark datasets across various application domains compared with start-of-the-art anomaly detection models.
Enabling Trade-offs in Privacy and Utility in Genomic Data Beacons and Summary Statistics
Venkatesaramani, Rajagopal, Wan, Zhiyu, Malin, Bradley A., Vorobeychik, Yevgeniy
The collection and sharing of genomic data are becoming increasingly commonplace in research, clinical, and direct-to-consumer settings. The computational protocols typically adopted to protect individual privacy include sharing summary statistics, such as allele frequencies, or limiting query responses to the presence/absence of alleles of interest using web-services called Beacons. However, even such limited releases are susceptible to likelihood-ratio-based membership-inference attacks. Several approaches have been proposed to preserve privacy, which either suppress a subset of genomic variants or modify query responses for specific variants (e.g., adding noise, as in differential privacy). However, many of these approaches result in a significant utility loss, either suppressing many variants or adding a substantial amount of noise. In this paper, we introduce optimization-based approaches to explicitly trade off the utility of summary data or Beacon responses and privacy with respect to membership-inference attacks based on likelihood-ratios, combining variant suppression and modification. We consider two attack models. In the first, an attacker applies a likelihood-ratio test to make membership-inference claims. In the second model, an attacker uses a threshold that accounts for the effect of the data release on the separation in scores between individuals in the dataset and those who are not. We further introduce highly scalable approaches for approximately solving the privacy-utility tradeoff problem when information is either in the form of summary statistics or presence/absence queries. Finally, we show that the proposed approaches outperform the state of the art in both utility and privacy through an extensive evaluation with public datasets.
Factors other than climate change are currently more important in predicting how well fruit farms are doing financially
Obster, Fabian, Bohle, Heidi, Pechan, Paul M.
Machine learning and statistical modeling methods were used to analyze the impact of climate change on financial wellbeing of fruit farmers in Tunisia and Chile. The analysis was based on face to face interviews with 801 farmers. Three research questions were investigated. First, whether climate change impacts had an effect on how well the farm was doing financially. Second, if climate change was not influential, what factors were important for predicting financial wellbeing of the farm. And third, ascertain whether observed effects on the financial wellbeing of the farm were a result of interactions between predictor variables. This is the first report directly comparing climate change with other factors potentially impacting financial wellbeing of farms. Certain climate change factors, namely increases in temperature and reductions in precipitation, can regionally impact self-perceived financial wellbeing of fruit farmers. Specifically, increases in temperature and reduction in precipitation can have a measurable negative impact on the financial wellbeing of farms in Chile. This effect is less pronounced in Tunisia. Climate impact differences were observed within Chile but not in Tunisia. However, climate change is only of minor importance for predicting farm financial wellbeing, especially for farms already doing financially well. Factors that are more important, mainly in Tunisia, included trust in information sources and prior farm ownership. Other important factors include farm size, water management systems used and diversity of fruit crops grown. Moreover, some of the important factors identified differed between farms doing and not doing well financially. Interactions between factors may improve or worsen farm financial wellbeing.
A Stochastic Optimization Framework for Fair Risk Minimization
Lowy, Andrew, Baharlouei, Sina, Pavan, Rakesh, Razaviyayn, Meisam, Beirami, Ahmad
Despite the success of large-scale empirical risk minimization (ERM) at achieving high accuracy across a variety of machine learning tasks, fair ERM is hindered by the incompatibility of fairness constraints with stochastic optimization. We consider the problem of fair classification with discrete sensitive attributes and potentially large models and data sets, requiring stochastic solvers. Existing in-processing fairness algorithms are either impractical in the large-scale setting because they require large batches of data at each iteration or they are not guaranteed to converge. In this paper, we develop the first stochastic in-processing fairness algorithm with guaranteed convergence. For demographic parity, equalized odds, and equal opportunity notions of fairness, we provide slight variations of our algorithm--called FERMI--and prove that each of these variations converges in stochastic optimization with any batch size. Empirically, we show that FERMI is amenable to stochastic solvers with multiple (non-binary) sensitive attributes and non-binary targets, performing well even with minibatch size as small as one. Extensive experiments show that FERMI achieves the most favorable tradeoffs between fairness violation and test accuracy across all tested setups compared with state-of-the-art baselines for demographic parity, equalized odds, equal opportunity. These benefits are especially significant with small batch sizes and for non-binary classification with large number of sensitive attributes, making FERMI a practical, scalable fairness algorithm. The code for all of the experiments in this paper is available at: https://github.com/optimization-for-data-driven-science/FERMI.
Contrastive Neural Ratio Estimation
Miller, Benjamin Kurt, Weniger, Christoph, Forré, Patrick
Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A) or a multiclass (NRE-B) classification task. In contrast to the binary classification framework, the current formulation of the multiclass version has an intrinsic and unknown bias term, making otherwise informative diagnostics unreliable. We propose a multiclass framework free from the bias inherent to NRE-B at optimum, leaving us in the position to run diagnostics that practitioners depend on. It also recovers NRE-A in one corner case and NRE-B in the limiting case. For fair comparison, we benchmark the behavior of all algorithms in both familiar and novel training regimes: when jointly drawn data is unlimited, when data is fixed but prior draws are unlimited, and in the commonplace fixed data and parameters setting. Our investigations reveal that the highest performing models are distant from the competitors (NRE-A, NRE-B) in hyperparameter space. We make a recommendation for hyperparameters distinct from the previous models. We suggest a bound on the mutual information as a performance metric for simulation-based inference methods, without the need for posterior samples, and provide experimental results.
Combining Self-labeling with Selective Sampling
Kozal, Jędrzej, Woźniak, Michał
Since data is the fuel that drives machine learning models, and access to labeled data is generally expensive, semi-supervised methods are constantly popular. They enable the acquisition of large datasets without the need for too many expert labels. This work combines self-labeling techniques with active learning in a selective sampling scenario. We propose a new method that builds an ensemble classifier. Based on an evaluation of the inconsistency of the decisions of the individual base classifiers for a given observation, a decision is made on whether to request a new label or use the self-labeling. In preliminary studies, we show that naive application of self-labeling can harm performance by introducing bias towards selected classes and consequently lead to skewed class distribution. Hence, we also propose mechanisms to reduce this phenomenon. Experimental evaluation shows that the proposed method matches current selective sampling methods or achieves better results.