Goto

Collaborating Authors

 Performance Analysis


Detecting random filenames using (un)supervised machine learning

#artificialintelligence

Combining both n-grams and random forest models to detect malicious activity. An essential part of Managed Detection and Response at Fox-IT is the Security Operations Center. This is our frontline for detecting and analyzing possible threats. Our Security Operations Center brings together the best in human and machine analysis and we continually strive to improve both. For instance, we develop machine learning techniques for detecting malicious content such as DGA domains or unusual SMB traffic.


Regularized Adversarial Sampling and Deep Time-aware Attention for Click-Through Rate Prediction

arXiv.org Machine Learning

Improving the performance of click-through rate (CTR) prediction remains one of the core tasks in online advertising systems. With the rise of deep learning, CTR prediction models with deep networks remarkably enhance model capacities. In deep CTR models, exploiting users' historical data is essential for learning users' behaviors and interests. As existing CTR prediction works neglect the importance of the temporal signals when embed users' historical clicking records, we propose a time-aware attention model which explicitly uses absolute temporal signals for expressing the users' periodic behaviors and relative temporal signals for expressing the temporal relation between items. Besides, we propose a regularized adversarial sampling strategy for negative sampling which eases the classification imbalance of CTR data and can make use of the strong guidance provided by the observed negative CTR samples. The adversarial sampling strategy significantly improves the training efficiency, and can be co-trained with the time-aware attention model seamlessly. Experiments are conducted on real-world CTR datasets from both in-station and out-station advertising places.


A Study of Data Pre-processing Techniques for Imbalanced Biomedical Data Classification

arXiv.org Machine Learning

Biomedical data are widely accepted in developing prediction models for identifying a specific tumor, drug discovery and classification of human cancers. However, previous studies usually focused on different classifiers, and overlook the class imbalance problem in real-world biomedical datasets. There are a lack of studies on evaluation of data pre-processing techniques, such as resampling and feature selection, on imbalanced biomedical data learning. The relationship between data pre-processing techniques and the data distributions has never been analysed in previous studies. This article mainly focuses on reviewing and evaluating some popular and recently developed resampling and feature selection methods for class imbalance learning. We analyse the effectiveness of each technique from data distribution perspective. Extensive experiments have been done based on five classifiers, four performance measures, eight learning techniques across twenty real-world datasets. Experimental results show that: (1) resampling and feature selection techniques exhibit better performance using support vector machine (SVM) classifier. However, resampling and Feature Selection techniques perform poorly when using C4.5 decision tree and Linear discriminant analysis classifiers; (2) for datasets with different distributions, techniques such as Random undersampling and Feature Selection perform better than other data pre-processing methods with T Location-Scale distribution when using SVM and KNN (K-nearest neighbours) classifiers. Random oversampling outperforms other methods on Negative Binomial distribution using Random Forest classifier with lower level of imbalance ratio; (3) Feature Selection outperforms other data pre-processing methods in most cases, thus, Feature Selection with SVM classifier is the best choice for imbalanced biomedical data learning.



Fair Predictors under Distribution Shift

arXiv.org Machine Learning

Recent work on fair machine learning adds to a growing set of algorithmic safeguards required for deployment in high societal impact areas. A fundamental concern with model deployment is to guarantee stable performance under changes in data distribution. Extensive work in domain adaptation addresses this concern, albeit with the notion of stability limited to that of predictive performance. We provide conditions under which a stable model both in terms of prediction and fairness performance can be trained. Building on the problem setup of causal domain adaptation, we select a subset of features for training predictors with fairness constraints such that risk with respect to an unseen target data distribution is minimized. Advantages of the approach are demonstrated on synthetic datasets and on the task of diagnosing acute kidney injury in a real-world dataset under an instance of measurement policy shift and selection bias.


Improving Cross-Lingual Transfer Learning by Filtering Training Data : Alexa Blogs

#artificialintelligence

This type of cross-lingual transfer learning can make it easier to bootstrap a model in a language for which training data is scarce, by taking advantage of more abundant data in a source language. But sometimes the data in the source language is so abundant that using all of it to train a transfer model would be impractically time consuming. Moreover, linguistic differences between source and target languages mean that pruning the training data in the source language, so that its statistical patterns better match those of the target language, can actually improve the performance of the transferred model. In a paper we're presenting at this year's Conference on Empirical Methods in Natural Language Processing, we describe experiments with a new data selection technique that let us halve the amount of training data required in the source language, while actually improving a transfer model's performance in a target language. For evaluation purposes, we used two techniques to cut the source-language data set in half: one was our data selection technique, and the other was random sampling.


Optimizing portfolio value with Amazon SageMaker automatic model tuning Amazon Web Services

#artificialintelligence

Financial institutions that extend credit face the dual tasks of evaluating the credit risk associated with each loan application and determining a threshold that defines the level of risk they are willing to take on. The evaluation of credit risk is a common application of machine learning (ML) classification models. The determination of a classification threshold, though, is often treated as a secondary concern and set in an ad hoc, unprincipled manner. As a result, institutions may be creating underperforming portfolios and leaving risk-adjusted return on the table. In this blog post, we describe how to use Amazon SageMaker automatic model tuning to determine the classification threshold that maximizes the portfolio value of a lender choosing a subset of borrowers to lend to. More generally, we describe a method of choosing an optimal threshold, or set of thresholds, in a classification setting. The method we describe doesn't rely on rules of thumb or generic metrics. It is a systematic and principled method that relies on a business success metric specific to the problem at hand. The method is based upon utility theory and the idea that a rational individual makes decisions so as to maximize her expected utility, or subjective value. In this post, we assume that the lender is attempting to maximize the expected dollar value of her portfolio by choosing a classification threshold that divides loan applications into two groups: those she accepts and lends to, and those she rejects. In other words, the lender is searching over the space of potential threshold values to find the threshold that results in the highest value for the function that describes her portfolio value.


Machine learning Training in Hyderabad

#artificialintelligence

Work towards building a strong knowledge based career foundation in the leading analytics platform of Machine Learning by availing our Analytics Path top-notch Machine Learning Training In Hyderabad. Our experts trainers will be working towards transforming our students into complete career ready professionals. By the time of course completion, our students will become well capable to handling all the real-world complex challenges of the Machine Learning domain. Students will be gaining expertise towards working on the advanced concepts like Support Vector Machines, Naive Bayes Classification, Logistic Regression, Decision Tree Algorithms, K-Means Clustering and more. Machine Learning is the most challenging & innovative platform in the present days analytics domain.


Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success

arXiv.org Machine Learning

Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided.


Scaling structural learning with NO-BEARS to infer causal transcriptome networks

arXiv.org Machine Learning

Constructing gene regulatory networks is a critical step in revealing disease mechanisms from transcriptomic data. In this work, we present NO-BEARS, a novel algorithm for estimating gene regulatory networks. The NO-BEARS algorithm is built on the basis of the NOTEARS algorithm with two improvements. First, we propose a new constraint and its fast approximation to reduce the computational cost of the NO-TEARS algorithm. Next, we introduce a polynomial regression loss to handle non-linearity in gene expressions. Our implementation utilizes modern GPU computation that can decrease the time of hours-long CPU computation to seconds. Using synthetic data, we demonstrate improved performance, both in processing time and accuracy, on inferring gene regulatory networks from gene expression data.