We propose a hierarchical correlation clustering method that extends the well-known correlation clustering to produce hierarchical clusters. We then investigate embedding the respective hierarchy to be used for (tree preserving) embedding and feature extraction. We study the connection of such an embedding to single linkage embedding and minimax distances, and in particular study minimax distances for correlation clustering. Finally, we demonstrate the performance of our methods on several UCI and 20 newsgroup datasets.

We analyze the general behavior of agglomerative clustering methods, and argue that their strategy yields establishment of a new reliable linkage at each step. However, in order to provide adaptive, density-consistent and flexible solutions, we propose to extract all the reliable linkages at each step, instead of the smallest one. This leads to a new agglomerative clustering strategy, called reliable agglomerative clustering, which similar to the standard agglomerative variant can be applied with all common criteria. Moreover, we prove that this strategy with the \emph{single} linkage criterion yields a minimum spanning tree algorithm. We perform experiments on several real-world datasets to demonstrate the superior performance of this strategy, compared to the standard alternative.

Chehreghani, Morteza Haghir (Xerox Research Centre Europe)

Minimax distance measures provide an effective way to capture the unknown underlying patterns and classes of the data in a non-parametric way. We develop a general-purpose framework to employ Minimax distances with any classification method that performs on numerical data. For this purpose, we establish a two-step strategy. First, we compute the pairwise Minimax distances between the objects, using the equivalence of Minimax distances over a graph and over a minimum spanning tree constructed on that. Then, we perform an embedding of the pairwise Minimax distances into a new vector space, such that their squared Euclidean distances in the new space are equal to their Minimax distances in the original space. We also consider the cases where multiple pairwise Minimax matrices are given, instead of a single one. Thereby, we propose an embedding via first summing up the centered matrices and then performing an eigenvalue decomposition. We experimentally validate our framework on different synthetic and real-world datasets.

We investigate the use of Minimax distances to extract in a nonparametric way the features that capture the unknown underlying patterns and structures in the data. We develop a general-purpose framework to employ Minimax distances with many machine learning methods that perform on numerical data. For this purpose, first, we compute the pairwise Minimax distances between the objects, using the equivalence of Minimax distances over a graph and over a minimum spanning tree constructed on that. Then, we perform an embedding of the pairwise Minimax distances into a new vector space, such that their squared Euclidean distances in the new space equal to the pairwise Minimax distances in the original space. In the following, we study the case of having multiple pairwise Minimax matrices, instead of a single one. Thereby, we propose an embedding via first summing up the centered matrices and then performing an eigenvalue decomposition. Finally, we perform several experimental studies to illustrate the effectiveness of our framework.

Hierarchical clustering is a class of algorithms that seeks to build a hierarchy of clusters. It has been the dominant approach to constructing embedded classification schemes since it outputs dendrograms, which capture the hierarchical relationship among members at all levels of granularity, simultaneously. Being greedy in the algorithmic sense, a hierarchical clustering partitions data at every step solely based on a similarity / dissimilarity measure. The clustering results oftentimes depend on not only the distribution of the underlying data, but also the choice of dissimilarity measure and the clustering algorithm. In this paper, we propose a method to incorporate prior domain knowledge about entity relationship into the hierarchical clustering. Specifically, we use a distance function in ultrametric space to encode the external ontological information. We show that popular linkage-based algorithms can faithfully recover the encoded structure. Similar to some regularized machine learning techniques, we add this distance as a penalty term to the original pairwise distance to regulate the final structure of the dendrogram. As a case study, we applied this method on real data in the building of a customer behavior based product taxonomy for an Amazon service, leveraging the information from a larger Amazon-wide browse structure. The method is useful when one wants to leverage the relational information from external sources, or the data used to generate the distance matrix is noisy and sparse. Our work falls in the category of semi-supervised or constrained clustering.