Finding the most similar textual documents using Case-Based Reasoning

Nov-1-2019–arXiv.org Machine Learning

--In recent years, huge amounts of unstructured textual data on the Internet are a big difficulty for AI algorithms to provide the best recommendations for users and their search queries. Since the Internet became widespread, a lot of research has been done in the field of Natural Language Processing (NLP) and machine learning. Almost every solution transforms documents into V ector Space Models (VSM) in order to apply AI algorithms over them. One such approach is based on Case-Based Reasoning (CBR). Therefore, the most important part of those systems is to compute the similarity between numerical data points. In 2016, the new similarity TS-SS metric is proposed, which showed state-of-the-art results in the field of textual mining for unsupervised learning. However, no one before has investigated its performances for supervised learning (classification task). In this work, we devised a CBR system capable of finding the most similar documents for a given query aiming to investigate performances of the new state-of- the-art metric, TS-SS, in addition to the two other geometrical similarity measures -- Euclidean distance and Cosine similarity -- that showed the best predictive results over several benchmark corpora. The results show surprising inappropriateness of TS-SS measure for high dimensional features.

dataset, feature vector, similarity metric, (13 more...)

arXiv.org Machine Learning

Nov-1-2019

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New Jersey > Mercer County > Princeton (0.04)
- Europe
  - Sweden (0.04)
  - Switzerland > Zürich
    - Zürich (0.14)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Case-Based Reasoning (1.00)
  - Natural Language (1.00)
  - Machine Learning > Memory-Based Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found