Goto

Collaborating Authors

 Learning Graphical Models


Polymorphic Malware Detection Using Sequence Classification Methods

@machinelearnbot

Polymorphic malware detection is challenging due to the continual mutations miscreants introduce to successive instances of a particular virus. Such changes are akin to mutations in biological sequences. Recently, high-throughput methods for gene sequence classification have been developed by the bioinformatics and computational biology communities. In this paper, we argue that these methods can be usefully applied to malware detection. Unfortunately, gene classification tools are usually optimized for and restricted to an alphabet of four letters (nucleic acids). Consequently, we have selected the Strand gene sequence classifier, which offers a robust classification strategy that can easily accommodate unstructured data with any alphabet including source code or compiled machine code. To demonstrate Stand's suitability for classifying malware, we execute it on approximately 500GB of malware data provided by the Kaggle Microsoft Malware Classification Challenge (BIG 2015) used for predicting 9 classes of polymorphic malware. Experiments show that, with minimal adaptation, the method achieves accuracy levels well above 95% requiring only a fraction of the training times used by the winning team's method.


User Model-Based Intent-Aware Metrics for Multilingual Search Evaluation

arXiv.org Machine Learning

Despite the growing importance of multilingual aspect of web search, no appropriate offline metrics to evaluate its quality are proposed so far. At the same time, personal language preferences can be regarded as intents of a query. This approach translates the multilingual search problem into a particular task of search diversification. Furthermore, the standard intent-aware approach could be adopted to build a diversified metric for multilingual search on the basis of a classical IR metric such as ERR. The intent-aware approach estimates user satisfaction under a user behavior model. We show however that the underlying user behavior models is not realistic in the multilingual case, and the produced intent-aware metric do not appropriately estimate the user satisfaction. We develop a novel approach to build intent-aware user behavior models, which overcome these limitations and convert to quality metrics that better correlate with standard online metrics of user satisfaction.


Inferring object rankings based on noisy pairwise comparisons from multiple annotators

arXiv.org Machine Learning

Ranking a set of objects involves establishing an order allowing for comparisons between any pair of objects in the set. Oftentimes, due to the unavailability of a ground truth of ranked orders, researchers resort to obtaining judgments from multiple annotators followed by inferring the ground truth based on the collective knowledge of the crowd. However, the aggregation is often ad-hoc and involves imposing stringent assumptions in inferring the ground truth (e.g. majority vote). In this work, we propose Expectation-Maximization (EM) based algorithms that rely on the judgments from multiple annotators and the object attributes for inferring the latent ground truth. The algorithm learns the relation between the latent ground truth and object attributes as well as annotator specific probabilities of flipping, a metric to assess annotator quality. We further extend the EM algorithm to allow for a variable probability of flipping based on the pair of objects at hand. We test our algorithms on two data sets with synthetic annotations and investigate the impact of annotator quality and quantity on the inferred ground truth. We also obtain the results on two other data sets with annotations from machine/human annotators and interpret the output trends based on the data characteristics.


Towards Adaptive Training of Agent-based Sparring Partners for Fighter Pilots

arXiv.org Machine Learning

A key requirement for the current generation of artificial decision-makers is that they should adapt well to changes in unexpected situations. This paper addresses the situation in which an AI for aerial dog fighting, with tunable parameters that govern its behavior, must optimize behavior with respect to an objective function that is evaluated and learned through simulations. Bayesian optimization with a Gaussian Process surrogate is used as the method for investigating the objective function. One key benefit is that during optimization, the Gaussian Process learns a global estimate of the true objective function, with predicted outcomes and a statistical measure of confidence in areas that haven't been investigated yet. Having a model of the objective function is important for being able to understand possible outcomes in the decision space; for example this is crucial for training and providing feedback to human pilots. However, standard Bayesian optimization does not perform consistently or provide an accurate Gaussian Process surrogate function for highly volatile objective functions. We treat these problems by introducing a novel sampling technique called Hybrid Repeat/Multi-point Sampling. This technique gives the AI ability to learn optimum behaviors in a highly uncertain environment. More importantly, it not only improves the reliability of the optimization, but also creates a better model of the entire objective surface. With this improved model the agent is equipped to more accurately/efficiently predict performance in unexplored scenarios.


The Impact Of Google RankBrain on Digital Marketing

#artificialintelligence

Secret to GoogleBrain and RankBrain algorithm revealed. One is going to give a historical overview about GoogleBrain and analyse the pattern, then we will conclude our finding about the current situation and future changes in search engine algorithm. Back in 2006 there were some interests in implementing artificial intelligence in Google search engine algorithm. A few years later in 2014, GoogleBrain was established after acquisition of DeepMind, a British artificial intelligence company which was founded in 2010. They worked on how to play video games based on machine learning and artificial neural networks (ANNs).


Data Science: Supervised Machine Learning in Python

#artificialintelligence

In recent years, we've seen a resurgence in AI, or artificial intelligence, and machine learning. Machine learning has led to some amazing results, like being able to analyze medical images and predict diseases on-par with human experts. Google's AlphaGo program was able to beat a world champion in the strategy game go using deep reinforcement learning. Machine learning is even being used to program self driving cars, which is going to change the automotive industry forever. Imagine a world with drastically reduced car accidents, simply by removing the element of human error.


Generating Music using Markov Chains

#artificialintelligence

In a nutshell, Markov chains are mathematical systems that track the probabilities of state transitions. They're often used to model complex systems and predict behavior. They're used in a lot commercial applications, from text autocomplete to Google's PageRank algorithm. My first encounter with a Markov chain was actually in my high school software development class when a classmate built a chat bot using this concept. He took the log from our class Slack chat and fed it into a Markov chain.


Data Scientist - Machine Learning @ Booking.com

#artificialintelligence

Would you like to translate terabytes of data into unforgettable holidays for millions of people around the globe? Booking.com, the world's largest accommodation booking website, is looking for rock star Data Scientists to add to join our highly successful Personalization Team within the Front End department. This product development team crunches endless amounts of data to provide our customers with the best possible experience. They focus on anything from understanding and predicting market data, to ranking all properties on our website, and providing our customers with the most relevant personalized recommendations. As a Data Scientist you'll work side by side with Developers, Designers and Product Owners, and take full ownership of your work – from the initial idea-generation phase to the implementation of the final product on our website. Our ideal candidate is result-focused, innovative and has solid quantitative background and a good business understanding.


Searching for the Master Algorithm - New Signature

#artificialintelligence

It may sound trite, but humanity has come to dominate the world using this tool alone. Humans lack natural weapons, have no natural protection from the elements, and enter life as helpless infants. But our unique brains allow us to acquire, use, and communicate knowledge, and this advantage alone has allowed us to create the intricate social and technological reality we now inhabit. Our brains evolved to process, store, retrieve, and integrate sensory data into working knowledge that allows us to navigate reality. Until recently, humans were the only significant force that could translate raw data into accurate, actionable knowledge.


Bayes Theorem: A Visual Introduction For Beginners

#artificialintelligence

From Google search results to Netflix recommendations and investment strategies, Bayes Theorem (also often called Bayes Rule or Bayes Formula) is used across countless industries to help calculate and assess probability. Bayesian statistics is taught in most first-year statistics classes across the nation, but there is one major problem that many students (and others who are interested in the theorem) face. The theorem is not intuitive for most people, and understanding how it works can be a challenge, especially because it is often taught without visual aids. In this guide, we unpack the various components of the theorem and provide a basic overview of how it works – and with illustrations to help. Three scenarios – the flu, breathalyzer tests, and peacekeeping – are used throughout the booklet to teach how problems involving Bayes Theorem can be approached and solved.