Collaborating Authors

similarity measure

How to Explore a Dataset of Images with Graph Theory


When you start working on a dataset that consists of pictures, you'll probably be asked such questions as: can you check if the pictures are good? A quick-and-dirty solution would be to manually look at the data one by one and try to sort them out, but that might be tedious work depending on how many pictures you get. For example, in manufacturing, you could get a sample with thousands of pictures from a production line consisting of batteries of different types and sizes. You'll have to manually go through all pictures and arrange them by type, size, or even color. The other and more efficient option, on the other hand, would be to go the computer vision route and find an algorithm that can automatically arrange and sort your images -- this is the goal of this article. But how can we automate what a person does, i.e. compare pictures two by two with one another and sort them based on similarities?

Text Similarity w/ Levenshtein Distance in Python


In this article I will go over the intuition behind how Levenshtein distance works and how to use Levenshtein distance in building a plagiarism detection pipeline. Identifying similarity between text is a common problem in NLP and is used by many companies world wide. The most common application of text similarity comes from the form of identifying plagiarized text. Educational facilities ranging from elementary school, high school, college and universities all around the world use services like Turnitin to ensure that the work submitted by students is original and their own. Other applications of text similarity is commonly used by companies which have a similar structure to Stack Overflow or Stack Exchange.


AAAI Conferences

In Description Logic (DL) knowledge bases (KBs) information is typically captured by crisp concepts. For many applications, querying the KB by crisp query concepts is too restrictive. A controlled way of gradually relaxing a query concept can be achieved by the use of concept similarity measures. In this paper we formalize the task of instance query answering for crisp DL KBs using concepts relaxed by concept similarity measures. We investigate computation algorithms for this task in the DL EL, their complexity and properties for the employed similarity measure regarding whether unfoldable or general TBoxes are used.


AAAI Conferences

We present a case-based approach to character identification in natural language text in the context of our Voz system. Voz first extracts entities from the text, and for each one of them, computes a feature-vector using both linguistic information and external knowledge. We propose a new similarity measure called Continuous Jaccard that exploits those feature-vectors to compute the similarity between a given entity and those in the case-base, and thus determine which entities are characters or not. We evaluate our approach by comparing it with different similarity measures and feature sets. Results show an identification accuracy of up to 93.49%, significantly higher than recent related work.


AAAI Conferences

In the field of argumentation, the vision of robust argumentation machines is investigated. They explore natural language arguments from available information sources on the web and reason with them on the knowledge level to actively support the deliberation and synthesis of arguments for a particular query of a user. We aim at combining methods from case-based reasoning (CBR), information retrieval, and computational argumentation to contribute to the foundations of such argumentation machines. In this paper, we focus on the retrieval phase of a CBR approach for an argumentation machine and propose similarity measures for arguments represented as argument graphs. We evaluate the similarity measures on a corpus of annotated micro texts containing different topics and demonstrate the benefit of semantic similarity measures as well as the relevance of structural aspects.


AAAI Conferences

During the early stages of developing Case-Based Reasoning (CBR) systems the definition of similarity measures is challenging since this task requires to transfer implicit knowledge of domain experts into knowledge representations. While an entire CBR system is very explanatory, the similarity measure determines the ranking but do not necessarily show which features contribute to high (or low) rankings. In this paper we will present our work on opening the knowledge engineering process for similarity modelling. We will present how we transfer implicit knowledge from experts as well as a data-driven approach into case and similarity representations for CBR systems. The work present is a result of interdisciplinary research collaborations between AI and medical researchers developing e-Health applications.


AAAI Conferences

In this paper, we briefly address a research regarding how to objectively evaluate machine-based object similarity measures by human-based estimation. Based on a novel approach for similarity measure of 3-D objects we create a ground truth of 3-D objects and their similarities estimated by humans. The automatic similarity results achieved are evaluated against this ground truth in terms of precision and recall in an object retrieval scenario. To further illustrate the reciprocity properties between machine and human perception, we compare the similarities achieved by both on testing data and show how it can be used to address other problems and formulations.

A new similarity measure for covariate shift with applications to nonparametric regression Machine Learning

In the standard formulation of prediction or classification, future data (as represented by a test set) is assumed to be drawn from the same distribution as the training data. This assumption, while theoretically convenient, may fail to hold in many real-world scenarios. For instance, training data might be collected only from a sub-group within a broader population (such as in medical trials), or the environment might change over time as data are collected. Such scenarios result in a distribution mismatch between the training and test data. In this paper, we study an important case of such distribution mismatch--namely, the covariate shift problem (e.g., [21, 19]). Suppose that a statistician observes covariate-response pairs (X, Y), and wishes to build a prediction rule. In the problem of covariate shift, the distribution of the covariates X is allowed to change between the training and test data, while the posterior distribution of the responses (namely, Y X) remains fixed. Compared to the usual i.i.d.

Deconfounded Representation Similarity for Comparison of Neural Networks Machine Learning

Similarity metrics such as representational similarity analysis (RSA) and centered kernel alignment (CKA) have been used to compare layer-wise representations between neural networks. However, these metrics are confounded by the population structure of data items in the input space, leading to spuriously high similarity for even completely random neural networks and inconsistent domain relations in transfer learning. We introduce a simple and generally applicable fix to adjust for the confounder with covariate adjustment regression, which retains the intuitive invariance properties of the original similarity measures. We show that deconfounding the similarity metrics increases the resolution of detecting semantically similar neural networks. Moreover, in real-world applications, deconfounding improves the consistency of representation similarities with domain similarities in transfer learning, and increases correlation with out-of-distribution accuracy.

Recursive Binding for Similarity-Preserving Hypervector Representations of Sequences Artificial Intelligence

Hyperdimensional computing (HDC), also known as vector symbolic architectures (VSA), is a computing framework used within artificial intelligence and cognitive computing that operates with distributed vector representations of large fixed dimensionality. A critical step for designing the HDC/VSA solutions is to obtain such representations from the input data. Here, we focus on sequences and propose their transformation to distributed representations that both preserve the similarity of identical sequence elements at nearby positions and are equivariant to the sequence shift. These properties are enabled by forming representations of sequence positions using recursive binding and superposition operations. The proposed transformation was experimentally investigated with symbolic strings used for modeling human perception of word similarity. The obtained results are on a par with more sophisticated approaches from the literature. The proposed transformation was designed for the HDC/VSA model known as Fourier Holographic Reduced Representations. However, it can be adapted to some other HDC/VSA models.