Goto

Collaborating Authors

 minkowski distance




Combined-distance-based score function of cognitive fuzzy sets and its application in lung cancer pain evaluation

Jiang, Lisheng, Zhang, Tianyu, Yan, Shiyu, Fang, Ran

arXiv.org Artificial Intelligence

In decision making, the cognitive fuzzy set (CFS) is a useful tool in expressing experts' complex assessments of alternatives. The distance of CFS, which plays an important role in decision analyses, is necessary when the CFS is applied in solving practical issues. However, as far as we know, the studies on the distance of CFS are few, and the current Minkowski distance of CFS ignores the hesitancy degree of CFS, which might cause errors. To fill the gap of the studies on the distance of CFS, because of the practicality of the Hausdorff distance, this paper proposes the improved cognitive fuzzy Minkowski (CF-IM) distance and the cognitive fuzzy Hausdorff (CF-H) distance to enrich the studies on the distance of CFS. It is found that the anti-perturbation ability of the CF-H distance is stronger than that of the CF-IM distance, but the information utilization of the CF-IM distance is higher than that of the CF-H distance. To balance the anti-perturbation ability and information utilization of the CF-IM distance and CF-H distance, the cognitive fuzzy combined (CF-C) distance is proposed by establishing the linear combination of the CF-IM distance and CF-H distance. Based on the CF-C distance, a combined-distanced-based score function of CFS is proposed to compare CFSs. The proposed score function is employed in lung cancer pain evaluation issues. The sensitivity and comparison analyses demonstrate the reliability and advantages of the proposed methods.


Review for NeurIPS paper: Duality-Induced Regularizer for Tensor Factorization Based Knowledge Graph Completion

Neural Information Processing Systems

Additional Feedback: - Line 70: DB methods are based on Minkowski distance, however, in this paper the duality is stablished only for the case of Frobenius norm, i.e. Minkowski distance with p 2. It would be nice that authors provide a deeper explanation about the role on parameter p in DB methods. What is the optimal value of p in state of the arts methods? The sentence: "the regularizer 5 and 6" should be changed to "the regularizer 4 and 5" (check equation numbering) - Line 180: As a regularizer having several terms, it would be convenient to consider different regularizer coefficients as hyperparameters. In fact, in supplemental material (lines 36-37) the cost function has 3 hyperparameters: lambda, lambda_1 and lambda_2.


Interpreting the Curse of Dimensionality from Distance Concentration and Manifold Effect

Peng, Dehua, Gui, Zhipeng, Wu, Huayi

arXiv.org Artificial Intelligence

The characteristics of data like distribution and heterogeneity, become more complex and counterintuitive as the dimensionality increases. This phenomenon is known as curse of dimensionality, where common patterns and relationships (e.g., internal and boundary pattern) that hold in low-dimensional space may be invalid in higher-dimensional space. It leads to a decreasing performance for the regression, classification or clustering models or algorithms. Curse of dimensionality can be attributed to many causes. In this paper, we first summarize five challenges associated with manipulating high-dimensional data, and explains the potential causes for the failure of regression, classification or clustering tasks. Subsequently, we delve into two major causes of the curse of dimensionality, distance concentration and manifold effect, by performing theoretical and empirical analyses. The results demonstrate that nearest neighbor search (NNS) using three typical distance measurements, Minkowski distance, Chebyshev distance, and cosine distance, becomes meaningless as the dimensionality increases. Meanwhile, the data incorporates more redundant features, and the variance contribution of principal component analysis (PCA) is skewed towards a few dimensions. By interpreting the causes of the curse of dimensionality, we can better understand the limitations of current models and algorithms, and drive to improve the performance of data analysis and machine learning tasks in high-dimensional space.


A Distribution-Based Threshold for Determining Sentence Similarity

Cadamuro, Gioele, Gruppo, Marco

arXiv.org Artificial Intelligence

We hereby present a solution to a semantic textual similarity (STS) problem in which it is necessary to match two sentences containing, as the only distinguishing factor, highly specific information (such as names, addresses, identification codes), and from which we need to derive a definition for when they are similar and when they are not. The solution revolves around the use of a neural network, based on the siamese architecture, to create the distributions of the distances between similar and dissimilar pairs of sentences. The goal of these distributions is to find a discriminating factor, that we call "threshold", which represents a well-defined quantity that can be used to distinguish vector distances of similar pairs from vector distances of dissimilar pairs in new predictions and later analyses. In addition, we developed a way to score the predictions by combining attributes from both the distributions' features and the way the distance function works. Finally, we generalize the results showing that they can be transferred to a wider range of domains by applying the system discussed to a well-known and widely used benchmark dataset for STS problems.


Turtle Score -- Similarity Based Developer Analyzer

Varshini, Sanjjushri, V, Ponshriharini, Kannan, Santhosh, Suresh, Snekha, Ramesh, Harshavardhan, Mahadevan, Rohith, Raman, Raja CSP

arXiv.org Machine Learning

In day-to-day life, a highly demanding task for IT companies is to find the right candidates who fit the companies' culture. This research aims to comprehend, analyze and automatically produce convincing outcomes to find a candidate who perfectly fits right in the company. Data is examined and collected for each employee who works in the IT domain focusing on their performance measure. This is done based on various different categories which bring versatility and a wide view of focus. To this data, learner analysis is done using machine learning algorithms to obtain learner similarity and developer similarity in order to recruit people with identical working patterns. It's been proven that the efficiency and capability of a particular worker go higher when working with a person of a similar personality. Therefore this will serve as a useful tool for recruiters who aim to recruit people with high productivity. This is to say that the model designed will render the best outcome possible with high accuracy and an immaculate recommendation score.


9 Distance Measures in Data Science

#artificialintelligence

Many algorithms, whether supervised or unsupervised, make use of distance measures. These measures, such as euclidean distance or cosine similarity, can often be found in algorithms such as k-NN, UMAP, HDBSCAN, etc. Understanding the field of distance measures is more important than you might realize. Take k-NN for example, a technique often used for supervised learning. As a default, it often uses euclidean distance. However, what if your data is highly dimensional?


9 Distance Measures in Data Science

#artificialintelligence

Many algorithms, whether supervised or unsupervised, make use of distance measures. These measures, such as euclidean distance or cosine similarity, can often be found in algorithms such as k-NN, UMAP, HDBSCAN, etc. Understanding the field of distance measures is more important than you might realize. Take k-NN for example, a technique often used for supervised learning. As a default, it often uses euclidean distance. However, what if your data is highly dimensional?


9 Distance Measures in Data Science

#artificialintelligence

Many algorithms, whether supervised or unsupervised, make use of distance measures. These measures, such as euclidean distance or cosine similarity, can often be found in algorithms such as k-NN, UMAP, HDBSCAN, etc. Understanding the field of distance measures is more important than you might realize. Take k-NN for example, a technique often used for supervised learning. As a default, it often uses euclidean distance. However, what if your data is highly dimensional?