Goto

Collaborating Authors

 Performance Analysis


Grouping the executables to detect malware with high accuracy

arXiv.org Artificial Intelligence

The metamorphic malware variants with the same malicious behavior (family), can obfuscate themselves to look different from each other. This variation in structure leads to a huge signature database for traditional signature matching techniques to detect them. In order to effective and efficient detection of malware in large amounts of executables, we need to partition these files into groups which can identify their respective families. In addition, the grouping criteria should be chosen such a way that, it can also be applied to unknown files encounter on computers for classification. This paper discusses the study of malware and benign executables in groups to detect unknown malware with high accuracy. We studied sizes of malware generated by three popular second generation malware (metamorphic malware) creator kits viz. G2, PS-MPC and NGVCK, and observed that the size variation in any two generated malware from same kit is not much. Hence, we grouped the executables on the basis of malware sizes by using Optimal k-Means Clustering algorithm and used these obtained groups to select promising features for training (Random forest, J48, LMT, FT and NBT) classifiers to detect variants of malware or unknown malware. We find that detection of malware on the basis of their respected file sizes gives accuracy up to 99.11% from the classifiers.


AI Boosts Cancer Screens to Nearly 100 Percent Accuracy

#artificialintelligence

Diagnosing cancer is about to get more accurate, with the help of artificial intelligence. Pathologists have diagnosed diseases in more or less the same way for the past 100 years, by laboring over a microscope reviewing biopsy samples on little glass slides. Working almost robotically, they sift through millions of normal cells to identify just a few diseased ones. The task is tedious and prone to human error. But now, scientists and engineers have created a technique that uses artificial intelligence (AI) and can differentiate cancer cells from normal cells almost as well as a top-notch pathologist.


Risk-consistency of cross-validation with lasso-type procedures

arXiv.org Machine Learning

The lasso and related sparsity inducing algorithms have been the target of substantial theoretical and applied research. Correspondingly, many results are known about their behavior for a fixed or optimally chosen tuning parameter specified up to unknown constants. In practice, however, this oracle tuning parameter is inaccessible so one must use the data to select one. Common statistical practice is to use a variant of cross-validation for this task. However, little is known about the theoretical properties of the resulting predictions with such data-dependent methods. We consider the high-dimensional setting with random design wherein the number of predictors $p$ grows with the number of observations $n$. Under typical assumptions on the data generating process, similar to those in the literature, we recover oracle rates up to a log factor when choosing the tuning parameter with cross-validation. Under weaker conditions, when the true model is not necessarily linear, we show that the lasso remains risk consistent relative to its linear oracle. We also generalize these results to the group lasso and square-root lasso and investigate the predictive and model selection performance of cross-validation via simulation.


Entry Point Data โ€“ Using Python's Sci-packages to Prepare Data for Machine Learning Tasks and other

#artificialintelligence

In this short tutorial I want to provide a short overview of some of my favorite Python tools for common procedures as entry points for general pattern classification and machine learning tasks, and various other data analyses. In this section want to recommend a way for installing the required Python-packages packages if you have not done so, yet. Otherwise you can skip this part. Although they can be installed step-by-step "manually", but I highly recommend you to take a look at the Anaconda Python distribution for scientific computing. Anaconda is distributed by Continuum Analytics, but it is completely free and includes more than 195 packages for science and data analysis as of today.


AI creates efficiencies in sanctions checking @Euromoney

#artificialintelligence

In transaction banking, the focus on technological development has centred on the possibilities of blockchain technology. However, this has overshadowed the arrival of AI into transaction-banking platforms. AI and machine learning are helping to further reduce manual checks and processes. The first target for implementation is sanctions and compliance. As companies become increasingly international, irrespective of size, checking against sanctions has become an essential activity for more than just the MNCs. AI can learn through experience what can pass through the sanctions filter, and what compliance obligations need to be checked.


Huge US facial recognition database flawed: audit

Daily Mail - Science & tech

The FBI's facial recognition database has more than 400 million pictures to help its criminal investigations, but lacks adequate safeguards for accuracy and privacy protection, a congressional audit has revealed. Totalling 411.9 million images, privacy campaigners have slammed the'unprecedented number of photographs, most of which are of Americans and foreigners who have committed no crimes.' The huge database - which enables investigators to automatically search images for criminal suspects - 'is far greater than had previously been understood' and raises concerns'about the risk of innocent Americans being inadvertently swept up in criminal investigations,' said Senator Al Franken, who requested the study. The FBI's facial recognition database includes some 30 million criminal mugshots and 140 million images from visa applications by foreign nationals The FBI's database includes some 30 million criminal mugshots and 140 million images from visa applications by foreign nationals, the GAO found. It also contains drivers' license pictures from 16 US states and 6.7 million photos from the Defense Department's biometric identification system of individuals detained by US forces abroad, among others.


Cross-validation in R: a do-it-yourself and a black box approach

@machinelearnbot

In my previous post, we saw that R-squared can lead to a misleading interpretation of the quality of our regression fit, in terms of prediction power. One thing that R-squared offers no protection against is overfitting. On the other hand, cross validation, by allowing us to have cases in our testing set that are different from the cases in our training set, inherently offers protection against overfittting. In this type of validation, one case in our data set is used as the test set, while the remaining cases are used as the training set. We iterate through the data set, until all cases have served as the test set.


ACDC: $\alpha$-Carving Decision Chain for Risk Stratification

arXiv.org Machine Learning

In many healthcare settings, intuitive decision rules for risk stratification can help effective hospital resource allocation. This paper introduces a novel variant of decision tree algorithms that produces a chain of decisions, not a general tree. Our algorithm, $\alpha$-Carving Decision Chain (ACDC), sequentially carves out "pure" subsets of the majority class examples. The resulting chain of decision rules yields a pure subset of the minority class examples. Our approach is particularly effective in exploring large and class-imbalanced health datasets. Moreover, ACDC provides an interactive interpretation in conjunction with visual performance metrics such as Receiver Operating Characteristics curve and Lift chart.


No penalty no tears: Least squares in high-dimensional linear models

arXiv.org Machine Learning

Ordinary least squares (OLS) is the default method for fitting linear models, but is not applicable for problems with dimensionality larger than the sample size. For these problems, we advocate the use of a generalized version of OLS motivated by ridge regression, and propose two novel three-step algorithms involving least squares fitting and hard thresholding. The algorithms are methodologically simple to understand intuitively, computationally easy to implement efficiently, and theoretically appealing for choosing models consistently. Numerical exercises comparing our methods with penalization-based approaches in simulations and data analyses illustrate the great potential of the proposed algorithms.


Invincea First Machine Learning Based Endpoint Security Company to Join Anti-Malware Testing Standards Organization (AMTSO(TM))

#artificialintelligence

FAIRFAX, VA--(Marketwired - June 15, 2016) - Invincea, the leader in advanced endpoint threat protection, announced today that it is the first machine learning based endpoint security company to join the Anti-Malware Testing Standards Organization (AMTSO). Participation in AMTSO furthers Invincea's mission of addressing the global need for improvement in third party testing based on scientific objectivity, quality, and relevance of anti-malware testing methodologies. Hundreds of millions of new pieces of malware are created a year, wreaking havoc on enterprises across industries against the backdrop of obsolete anti-malware approaches. To combat the scourge of malware that evades traditional anti-malware systems, the next-gen endpoint security market has exploded with new companies bringing products to market with fantastic claims. To date, these companies have not been held accountable to their marketing claims by independent scientifically valid testing on the merits of their product technology and approaches.