Goto

Collaborating Authors

Fuzzy Hashing as Perturbation-Consistent Adversarial Kernel Embedding

arXiv.org Machine Learning

Measuring the similarity of two files is an important task in malware analysis, with fuzzy hash functions being a popular approach. Traditional fuzzy hash functions are data agnostic: they do not learn from a particular dataset how to determine similarity; their behavior is fixed across all datasets. In this paper, we demonstrate that fuzzy hash functions can be learned in a novel minimax training framework and that these learned fuzzy hash functions outperform traditional fuzzy hash functions at the file similarity task for Portable Executable files. In our approach, hash digests can be extracted from the kernel embeddings of two kernel networks, trained in a minimax framework, where the roles of players during training (i.e adversary versus generator) alternate along with the input data. We refer to this new minimax architecture as perturbation-consistent. The similarity score for a pair of files is the utility of the minimax game in equilibrium. Our experiments show that learned fuzzy hash functions generalize well, capable of determining that two files are similar even when one of those files was generated using insertion and deletion operations.


A kernel-based framework for learning graded relations from data

arXiv.org Machine Learning

Driven by a large number of potential applications in areas like bioinformatics, information retrieval and social network analysis, the problem setting of inferring relations between pairs of data objects has recently been investigated quite intensively in the machine learning community. To this end, current approaches typically consider datasets containing crisp relations, so that standard classification methods can be adopted. However, relations between objects like similarities and preferences are often expressed in a graded manner in real-world applications. A general kernel-based framework for learning relations from data is introduced here. It extends existing approaches because both crisp and graded relations are considered, and it unifies existing approaches because different types of graded relations can be modeled, including symmetric and reciprocal relations. This framework establishes important links between recent developments in fuzzy set theory and machine learning. Its usefulness is demonstrated through various experiments on synthetic and real-world data.


Analogy-Based Preference Learning with Kernels

arXiv.org Machine Learning

Building on a specific formalization of analogical relationships of the form "A relates to B as C relates to D", we establish a connection between two important subfields of artificial intelligence, namely analogical reasoning and kernel-based machine learning. More specifically, we show that so-called analogical proportions are closely connected to kernel functions on pairs of objects. Based on this result, we introduce the analogy kernel, which can be seen as a measure of how strongly four objects are in analogical relationship. As an application, we consider the problem of object ranking in the realm of preference learning, for which we develop a new method based on support vector machines trained with the analogy kernel. Our first experimental results for data sets from different domains (sports, education, tourism, etc.) are promising and suggest that our approach is competitive to state-of-the-art algorithms in terms of predictive accuracy.


Interval type-2 Beta Fuzzy Near set based approach to content based image retrieval

arXiv.org Artificial Intelligence

Abstract-- In an automated search system, similarity is a key concept in solving a human task. Indeed, human process is usually a natural categorization that underlies many natural abilities such as image recovery, language comprehension, decision making, or pattern recognition. In the image search axis, there are several ways to measure the similarity between images in an image database, to a query image. Image search by content is based on the similarity of the visual characteristics of the images. The distance function used to evaluate the similarity between images depends on the criteria of the search but also on the representation of the characteristics of the image; this is the main idea of the near and fuzzy sets approaches. In this article, we introduce a new category of beta type-2 fuzzy sets for the description of image characteristics as well as the near sets approach for image recovery. Finally, we illustrate our work with examples of image recovery problems used in the real world. I. INTRODUCTION He number of daily-generated images by websites and personal archives are constantly growing. Indeed, the effective management of the rapid expansion of visual information has become a major problem and a necessity for strengthening visual search technique based on visual content [3]. This necessity is behind the emergence of new visual search techniques based on visual content. It has been widely identified that the most efficient and intuitive way to research visual information is based on the properties that are extracted from the images themselves. Researchers from different communities ("Computer Vision" [4], "Database Management", "Man-machine Interface", "Information Retrieval") were attracted by this field. Since then, the search for images by content has developed quite rapidly. The intuitive idea of "any system that analyzes or automatically organizes a set of data or knowledge must use, in one form or another, a similarity operator whose purpose is to establish similarities or the relationships that exist between the manipulated information".


Semantic distillation: a method for clustering objects by their contextual specificity

arXiv.org Machine Learning

Techniques for data-mining, latent semantic analysis, contextual search of databases, etc. have long ago been developed by computer scientists working on information retrieval (IR). Experimental scientists, from all disciplines, having to analyse large collections of raw experimental data (astronomical, physical, biological, etc.) have developed powerful methods for their statistical analysis and for clustering, categorising, and classifying objects. Finally, physicists have developed a theory of quantum measurement, unifying the logical, algebraic, and probabilistic aspects of queries into a single formalism. The purpose of this paper is twofold: first to show that when formulated at an abstract level, problems from IR, from statistical data analysis, and from physical measurement theories are very similar and hence can profitably be cross-fertilised, and, secondly, to propose a novel method of fuzzy hierarchical clustering, termed \textit{semantic distillation} -- strongly inspired from the theory of quantum measurement --, we developed to analyse raw data coming from various types of experiments on DNA arrays. We illustrate the method by analysing DNA arrays experiments and clustering the genes of the array according to their specificity.