Finding the most similar textual documents using Case-Based Reasoning
Mihajlovic, Marko, Xiong, Ning
--In recent years, huge amounts of unstructured textual data on the Internet are a big difficulty for AI algorithms to provide the best recommendations for users and their search queries. Since the Internet became widespread, a lot of research has been done in the field of Natural Language Processing (NLP) and machine learning. Almost every solution transforms documents into V ector Space Models (VSM) in order to apply AI algorithms over them. One such approach is based on Case-Based Reasoning (CBR). Therefore, the most important part of those systems is to compute the similarity between numerical data points. In 2016, the new similarity TS-SS metric is proposed, which showed state-of-the-art results in the field of textual mining for unsupervised learning. However, no one before has investigated its performances for supervised learning (classification task). In this work, we devised a CBR system capable of finding the most similar documents for a given query aiming to investigate performances of the new state-of- the-art metric, TS-SS, in addition to the two other geometrical similarity measures -- Euclidean distance and Cosine similarity -- that showed the best predictive results over several benchmark corpora. The results show surprising inappropriateness of TS-SS measure for high dimensional features.
Nov-1-2019
- Country:
- North America > United States
- New Jersey > Mercer County > Princeton (0.04)
- Europe
- Sweden (0.04)
- Switzerland > Zürich
- Zürich (0.14)
- North America > United States
- Genre:
- Research Report > New Finding (0.34)
- Technology: